Title: Personal Loan Campaign¶

Overview¶

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

You need to identify the best possible model that will give the required performance

Objective¶

1.Explore and visualize the dataset. 2.Build a classification model to predict if the customer is going to churn or not. 3.Optimize the model using appropriate techniques. 4.Generate a set of insights and recommendations that will help the bank. 5.To predict which cutomers would leave the bank. 6.Which variables are most significant. 7.Which segment of customers should be highlighted.

Questions¶

-Are there identifiable trends? -What is the customer profile? -Which customers have a higher probability of leaving the bank? -which variables cause those customers to leave?

Data Dictionary¶

-Personal_Loan is the dependent variable

  • CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  • Customer_Age: Age in Years
  • Gender: Gender of the account holder
  • Dependent_count: Number of dependents
  • Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  • Marital_Status: Marital Status of the account holder
  • Income_Category: Annual Income Category of the account holder
  • Card_Category: Type of Card
  • Months_on_book: Period of relationship with the bank
  • Total_Relationship_Count: Total no. of products held by the customer
  • Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  • Credit_Limit: Credit Limit on the Credit Card
  • Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  • Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  • Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total - transaction count in 1st quarter
  • Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  • Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Libraries¶

In [110]:
!pip install imblearn
Requirement already satisfied: imblearn in /Users/rcs713/opt/anaconda3/lib/python3.9/site-packages (0.0)
Requirement already satisfied: imbalanced-learn in /Users/rcs713/opt/anaconda3/lib/python3.9/site-packages (from imblearn) (0.9.1)
Requirement already satisfied: scikit-learn>=1.1.0 in /Users/rcs713/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.1.3)
Requirement already satisfied: joblib>=1.0.0 in /Users/rcs713/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.1.0)
Requirement already satisfied: scipy>=1.3.2 in /Users/rcs713/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.7.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/rcs713/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (2.2.0)
Requirement already satisfied: numpy>=1.17.3 in /Users/rcs713/opt/anaconda3/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.21.5)
In [111]:
### IMPORT: ------------------------------------
import scipy.stats as stats 
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings

import statsmodels.api as sm
#--Sklearn library--
from sklearn.model_selection import train_test_split,StratifiedKFold, cross_val_score # Sklearn package's randomized data splitting function

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
    StackingClassifier
)

from xgboost import XGBClassifier
from sklearn import tree
from sklearn.linear_model import LogisticRegression

# To impute missing values
from sklearn.impute import KNNImputer
# Libtune to tune model, get different metric scores

from sklearn.metrics import  classification_report, accuracy_score, precision_score, recall_score,f1_score
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay,plot_confusion_matrix #to plot confusion matric

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_colwidth',200)
# To supress numerical display in scientific notations
pd.set_option('display.float_format', lambda x: '%.5f' % x) 
warnings.filterwarnings('ignore') # To supress warnings
 # set the background for the graphs
plt.style.use('ggplot')
# For pandas profiling

Understanding the data¶

In [112]:
#Reading the Excel file  used tourism.xlsx 
data=pd.read_csv("BankChurners.csv")

df=data


dfc=df.copy()
print(f'There are {dfc.shape[0]} rows and {dfc.shape[1]} columns') # fstring 
There are 10127 rows and 21 columns
In [113]:
# View the first  5 rows of the dataset.
dfc.head(100)
Out[113]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.00000 777 11914.00000 1.33500 1144 42 1.62500 0.06100
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.00000 864 7392.00000 1.54100 1291 33 3.71400 0.10500
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.00000 0 3418.00000 2.59400 1887 20 2.33300 0.00000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.00000 2517 796.00000 1.40500 1171 20 2.33300 0.76000
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.00000 0 4716.00000 2.17500 816 28 2.50000 0.00000
5 713061558 Existing Customer 44 M 2 Graduate Married $40K - $60K Blue 36 3 1 2 4010.00000 1247 2763.00000 1.37600 1088 24 0.84600 0.31100
6 810347208 Existing Customer 51 M 4 NaN Married $120K + Gold 46 6 1 3 34516.00000 2264 32252.00000 1.97500 1330 31 0.72200 0.06600
7 818906208 Existing Customer 32 M 0 High School NaN $60K - $80K Silver 27 2 2 2 29081.00000 1396 27685.00000 2.20400 1538 36 0.71400 0.04800
8 710930508 Existing Customer 37 M 3 Uneducated Single $60K - $80K Blue 36 5 2 0 22352.00000 2517 19835.00000 3.35500 1350 24 1.18200 0.11300
9 719661558 Existing Customer 48 M 2 Graduate Single $80K - $120K Blue 36 6 3 3 11656.00000 1677 9979.00000 1.52400 1441 32 0.88200 0.14400
10 708790833 Existing Customer 42 M 5 Uneducated NaN $120K + Blue 31 5 3 2 6748.00000 1467 5281.00000 0.83100 1201 42 0.68000 0.21700
11 710821833 Existing Customer 65 M 1 NaN Married $40K - $60K Blue 54 6 2 3 9095.00000 1587 7508.00000 1.43300 1314 26 1.36400 0.17400
12 710599683 Existing Customer 56 M 1 College Single $80K - $120K Blue 36 3 6 0 11751.00000 0 11751.00000 3.39700 1539 17 3.25000 0.00000
13 816082233 Existing Customer 35 M 3 Graduate NaN $60K - $80K Blue 30 5 1 3 8547.00000 1666 6881.00000 1.16300 1311 33 2.00000 0.19500
14 712396908 Existing Customer 57 F 2 Graduate Married Less than $40K Blue 48 5 2 2 2436.00000 680 1756.00000 1.19000 1570 29 0.61100 0.27900
15 714885258 Existing Customer 44 M 4 NaN NaN $80K - $120K Blue 37 5 1 2 4234.00000 972 3262.00000 1.70700 1348 27 1.70000 0.23000
16 709967358 Existing Customer 48 M 4 Post-Graduate Single $80K - $120K Blue 36 6 2 3 30367.00000 2362 28005.00000 1.70800 1671 27 0.92900 0.07800
17 753327333 Existing Customer 41 M 3 NaN Married $80K - $120K Blue 34 4 4 1 13535.00000 1291 12244.00000 0.65300 1028 21 1.62500 0.09500
18 806160108 Existing Customer 61 M 1 High School Married $40K - $60K Blue 56 2 2 3 3193.00000 2517 676.00000 1.83100 1336 30 1.14300 0.78800
19 709327383 Existing Customer 45 F 2 Graduate Married abc Blue 37 6 1 2 14470.00000 1157 13313.00000 0.96600 1207 21 0.90900 0.08000
20 806165208 Existing Customer 47 M 1 Doctorate Divorced $60K - $80K Blue 42 5 2 0 20979.00000 1800 19179.00000 0.90600 1178 27 0.92900 0.08600
21 708508758 Attrited Customer 62 F 0 Graduate Married Less than $40K Blue 49 2 3 3 1438.30000 0 1438.30000 1.04700 692 16 0.60000 0.00000
22 784725333 Existing Customer 41 M 3 High School Married $40K - $60K Blue 33 4 2 1 4470.00000 680 3790.00000 1.60800 931 18 1.57100 0.15200
23 811604133 Existing Customer 47 F 4 NaN Single Less than $40K Blue 36 3 3 2 2492.00000 1560 932.00000 0.57300 1126 23 0.35300 0.62600
24 789124683 Existing Customer 54 M 2 NaN Married $80K - $120K Blue 42 4 2 3 12217.00000 0 12217.00000 1.07500 1110 21 0.75000 0.00000
25 771071958 Existing Customer 41 F 3 Graduate Single Less than $40K Blue 28 6 1 2 7768.00000 1669 6099.00000 0.79700 1051 22 0.83300 0.21500
26 720466383 Existing Customer 59 M 1 High School NaN $40K - $60K Blue 46 4 1 2 14784.00000 1374 13410.00000 0.92100 1197 23 1.30000 0.09300
27 804424383 Existing Customer 63 M 1 NaN Married $60K - $80K Blue 56 3 3 2 10215.00000 1010 9205.00000 0.84300 1904 40 1.00000 0.09900
28 718813833 Existing Customer 44 F 3 Uneducated Single abc Blue 34 5 2 2 10100.00000 0 10100.00000 0.52500 1052 18 1.57100 0.00000
29 806624208 Existing Customer 47 M 4 High School Married $40K - $60K Blue 42 6 0 0 4785.00000 1362 3423.00000 0.73900 1045 38 0.90000 0.28500
30 778348233 Existing Customer 53 M 3 NaN Married $80K - $120K Blue 33 3 2 3 2753.00000 1811 942.00000 0.97700 1038 25 2.57100 0.65800
31 712991808 Existing Customer 53 M 2 Uneducated Married $60K - $80K Blue 48 2 5 1 2451.00000 1690 761.00000 1.32300 1596 26 1.60000 0.69000
32 709029408 Existing Customer 41 M 4 Graduate Married $60K - $80K Blue 36 4 1 2 8923.00000 2517 6406.00000 1.72600 1589 24 1.66700 0.28200
33 788658483 Existing Customer 53 F 2 College Married Less than $40K Blue 38 5 2 3 2650.00000 1490 1160.00000 1.75000 1411 28 1.00000 0.56200
34 787937058 Existing Customer 58 M 0 Graduate Married $80K - $120K Blue 49 6 2 2 12555.00000 1696 10859.00000 0.51900 1291 24 0.71400 0.13500
35 715318008 Existing Customer 55 F 1 College Single Less than $40K Blue 36 4 2 1 3520.00000 1914 1606.00000 0.51000 1407 43 0.48300 0.54400
36 713962233 Existing Customer 55 F 3 Graduate Married Less than $40K Blue 36 6 2 3 3035.00000 2298 737.00000 1.72400 1877 37 1.17600 0.75700
37 785432733 Existing Customer 42 F 4 High School Married Less than $40K Gold 36 2 3 3 15433.00000 0 15433.00000 0.86500 966 22 1.20000 0.00000
38 715190283 Existing Customer 57 F 1 Graduate NaN $40K - $60K Blue 49 3 3 2 3672.00000 886 2786.00000 1.32000 1464 28 0.55600 0.24100
39 708300483 Attrited Customer 66 F 0 Doctorate Married abc Blue 56 5 4 3 7882.00000 605 7277.00000 1.05200 704 16 0.14300 0.07700
40 827111283 Existing Customer 45 M 3 Graduate Single $80K - $120K Blue 41 2 2 2 32426.00000 578 31848.00000 1.04200 1109 28 0.47400 0.01800
41 758551608 Existing Customer 51 M 2 NaN Married $40K - $60K Blue 44 4 1 0 6205.00000 2204 4001.00000 0.80300 1347 28 0.55600 0.35500
42 773146383 Existing Customer 50 F 1 College Single $40K - $60K Silver 43 3 2 3 17304.00000 2517 14787.00000 1.44900 1756 33 1.20000 0.14500
43 778493808 Existing Customer 49 M 3 High School Married $60K - $80K Blue 37 5 2 1 3906.00000 0 3906.00000 1.21400 1756 32 1.00000 0.00000
44 720572508 Existing Customer 38 F 4 Graduate Single abc Blue 28 2 3 3 9830.00000 2055 7775.00000 0.97700 1042 23 0.91700 0.20900
45 712661433 Existing Customer 49 M 4 Uneducated Single $80K - $120K Blue 30 3 2 3 34516.00000 0 34516.00000 1.62100 1444 28 1.33300 0.00000
46 789172683 Existing Customer 56 M 2 Doctorate Married $60K - $80K Blue 45 6 2 0 2283.00000 1430 853.00000 2.31600 1741 27 0.58800 0.62600
47 738406533 Existing Customer 59 M 1 Doctorate Married $40K - $60K Blue 52 3 2 2 2548.00000 2020 528.00000 2.35700 1719 27 1.70000 0.79300
48 799723908 Existing Customer 46 M 3 High School Married $80K - $120K Blue 40 4 3 3 19458.00000 1435 18023.00000 0.78700 1217 27 0.80000 0.07400
49 771490833 Existing Customer 52 M 1 College Single $80K - $120K Blue 40 5 1 1 4745.00000 1227 3518.00000 0.62400 1140 40 0.60000 0.25900
50 720756708 Existing Customer 52 F 3 NaN Married Less than $40K Blue 41 6 3 2 2622.00000 1549 1073.00000 1.32100 1878 30 1.14300 0.59100
51 779471883 Attrited Customer 54 F 1 Graduate Married Less than $40K Blue 40 2 3 1 1438.30000 808 630.30000 0.99700 705 19 0.90000 0.56200
52 711525033 Existing Customer 66 F 0 High School Married Less than $40K Blue 54 3 4 2 3171.00000 2179 992.00000 1.22400 1946 38 1.92300 0.68700
53 712813458 Existing Customer 49 M 2 NaN Married $120K + Blue 36 4 2 0 19763.00000 2517 17246.00000 0.66400 1414 35 0.25000 0.12700
54 714374133 Attrited Customer 56 M 2 Graduate Married $120K + Blue 36 1 3 3 15769.00000 0 15769.00000 1.04100 602 15 0.36400 0.00000
55 717891558 Existing Customer 49 F 4 Graduate NaN Less than $40K Blue 36 6 4 2 3298.00000 2200 1098.00000 0.67800 1052 32 0.60000 0.66700
56 716632758 Existing Customer 49 F 3 Graduate Single Less than $40K Blue 36 2 2 0 2802.00000 2363 439.00000 0.75000 1295 40 0.60000 0.84300
57 768563658 Existing Customer 56 M 2 Uneducated Married $40K - $60K Blue 50 4 2 3 4458.00000 1880 2578.00000 1.10700 1424 29 1.41700 0.42200
58 711427458 Existing Customer 44 F 5 Graduate Married abc Blue 35 4 1 2 6273.00000 978 5295.00000 2.27500 1359 25 1.08300 0.15600
59 714091983 Existing Customer 42 M 2 High School Single $60K - $80K Blue 34 4 4 3 3336.00000 1753 1583.00000 0.69000 1168 27 1.25000 0.52500
60 787584108 Existing Customer 55 M 3 NaN Married $80K - $120K Blue 47 4 2 3 3436.00000 2016 1420.00000 0.90100 1097 33 0.83300 0.58700
61 712030833 Attrited Customer 48 M 2 Graduate Married $60K - $80K Silver 35 2 4 4 34516.00000 0 34516.00000 0.76300 691 15 0.50000 0.00000
62 711481533 Existing Customer 39 M 1 High School Divorced $60K - $80K Blue 33 5 3 3 5926.00000 1251 4675.00000 0.94400 1316 28 1.15400 0.21100
63 710082708 Existing Customer 44 M 4 Post-Graduate Single $120K + Blue 32 2 4 2 23957.00000 2102 21855.00000 0.99700 1276 26 0.73300 0.08800
64 708155733 Existing Customer 53 M 2 High School Single $120K + Blue 44 4 2 2 14734.00000 1634 13100.00000 0.98900 1289 23 0.91700 0.11100
65 788979258 Existing Customer 51 M 4 Uneducated Single $80K - $120K Silver 38 4 1 4 34516.00000 1515 33001.00000 0.59200 1293 32 0.60000 0.04400
66 807986133 Existing Customer 57 M 2 College Married $60K - $80K Blue 52 5 3 3 6584.00000 1817 4767.00000 0.62000 1353 35 0.66700 0.27600
67 788730933 Existing Customer 44 F 2 Uneducated Single Less than $40K Blue 20 6 3 3 2084.00000 1468 616.00000 1.00400 1132 28 0.55600 0.70400
68 711314058 Existing Customer 49 M 2 Graduate Married $60K - $80K Blue 32 2 2 2 1687.00000 1107 580.00000 1.71500 1670 17 2.40000 0.65600
69 717975333 Existing Customer 50 M 2 Doctorate Married $80K - $120K Blue 38 6 2 2 25300.00000 1330 23970.00000 1.07200 837 15 2.00000 0.05300
70 715971108 Existing Customer 51 M 4 Graduate Single $120K + Blue 42 3 2 3 34516.00000 1763 32753.00000 1.26600 1550 41 1.05000 0.05100
71 720096558 Existing Customer 55 F 2 Graduate Married Less than $40K Blue 42 5 3 3 2216.00000 1034 1182.00000 0.75800 1540 36 0.28600 0.46700
72 719580033 Existing Customer 54 M 1 Graduate NaN $60K - $80K Blue 43 4 2 3 2910.00000 2030 880.00000 0.76900 1256 21 0.40000 0.69800
73 820582308 Existing Customer 42 M 5 Uneducated Married $80K - $120K Blue 37 6 2 2 22913.00000 1528 21385.00000 0.41400 1394 35 0.52200 0.06700
74 789973308 Existing Customer 44 M 1 College Single $60K - $80K Blue 35 3 3 3 24312.00000 1932 22380.00000 1.31200 1341 24 1.18200 0.07900
75 712876233 Existing Customer 53 M 2 Graduate Single $80K - $120K Blue 36 5 3 2 5272.00000 1515 3757.00000 0.85700 1289 33 0.43500 0.28700
76 804595158 Existing Customer 44 F 4 Graduate Single Less than $40K Blue 36 6 4 2 7000.00000 2517 4483.00000 0.47500 1112 23 1.87500 0.36000
77 714826758 Existing Customer 37 F 3 Uneducated Single Less than $40K Blue 29 4 4 2 7038.00000 1801 5237.00000 0.75100 2339 57 0.96600 0.25600
78 779058108 Existing Customer 49 M 3 Graduate Divorced $60K - $80K Blue 30 6 1 2 2536.00000 1823 713.00000 0.70300 1468 23 0.35300 0.71900
79 710790258 Existing Customer 47 M 2 Graduate Married $80K - $120K Blue 38 6 3 2 28904.00000 1899 27005.00000 0.85000 1334 35 0.40000 0.06600
80 715623483 Existing Customer 47 M 3 Graduate Married $60K - $80K Blue 37 4 4 0 8567.00000 1695 6872.00000 1.24200 1457 41 1.41200 0.19800
81 715156383 Existing Customer 44 M 1 NaN NaN $120K + Blue 36 6 2 2 34516.00000 1533 32983.00000 0.92400 1603 29 0.52600 0.04400
82 711013983 Attrited Customer 55 F 4 NaN Married $40K - $60K Blue 45 2 4 3 2158.00000 0 2158.00000 0.58500 615 12 0.71400 0.00000
83 755420433 Existing Customer 59 F 1 Graduate Married abc Blue 52 2 3 3 10133.00000 1417 8716.00000 0.38300 1068 20 0.81800 0.14000
84 794543958 Existing Customer 53 M 1 Graduate Divorced $80K - $120K Blue 35 5 4 2 34516.00000 1219 33297.00000 1.12900 1590 27 2.00000 0.03500
85 716396358 Existing Customer 52 M 2 Graduate Married $60K - $80K Blue 47 5 3 0 3085.00000 1910 1175.00000 0.92100 1531 35 0.66700 0.61900
86 715398033 Existing Customer 53 M 2 High School Single $120K + Blue 35 4 2 1 19040.00000 2056 16984.00000 0.60700 1212 31 0.72200 0.10800
87 711743883 Existing Customer 43 F 3 Uneducated Single Less than $40K Blue 35 5 2 3 4026.00000 0 4026.00000 0.48300 1237 32 0.60000 0.00000
88 719720058 Existing Customer 44 M 3 High School Single $60K - $80K Blue 31 4 3 1 12756.00000 837 11919.00000 1.93200 1413 14 1.80000 0.06600
89 778992108 Existing Customer 57 M 2 NaN Married $120K + Blue 45 5 3 3 5266.00000 0 5266.00000 1.70200 1516 29 1.63600 0.00000
90 717539808 Existing Customer 51 F 2 High School Single Less than $40K Blue 36 3 2 2 9930.00000 0 9930.00000 0.73100 1276 21 1.33300 0.00000
91 714070758 Existing Customer 49 M 4 High School Single $80K - $120K Blue 38 4 3 0 31302.00000 1953 29349.00000 0.87500 1564 35 2.18200 0.06200
92 714107958 Existing Customer 45 M 1 Graduate Single $40K - $60K Blue 36 4 4 3 6576.00000 0 6576.00000 0.57900 1465 34 0.61900 0.00000
93 789140283 Existing Customer 53 M 0 Graduate Single $80K - $120K Blue 42 5 4 1 2664.00000 2037 627.00000 0.85000 1286 29 0.93300 0.76500
94 715550508 Existing Customer 45 F 3 NaN Married abc Blue 28 5 1 2 2535.00000 2440 95.00000 1.70500 1312 20 1.22200 0.96300
95 719712633 Existing Customer 64 M 1 Graduate Married Less than $40K Blue 52 6 4 3 1709.00000 895 814.00000 1.65600 1673 32 0.88200 0.52400
96 772629333 Existing Customer 45 M 3 Graduate Married $40K - $60K Blue 35 5 4 2 3454.00000 1200 2254.00000 0.59700 1313 30 0.30400 0.34700
97 720336708 Existing Customer 53 M 3 Doctorate Married $40K - $60K Blue 35 5 3 2 3789.00000 1706 2083.00000 1.04700 1609 42 0.68000 0.45000
98 802013583 Existing Customer 56 M 3 College Married $120K + Blue 50 3 2 0 9689.00000 2250 7439.00000 0.57600 1158 19 0.72700 0.23200
99 711887583 Attrited Customer 47 M 2 NaN Married $80K - $120K Blue 37 2 3 3 5449.00000 1628 3821.00000 0.69600 836 18 0.38500 0.29900
In [114]:
# last 5 rows
dfc.tail(100)
Out[114]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10027 720001758 Existing Customer 42 F 1 Graduate Single Less than $40K Blue 34 5 1 3 3534.00000 1175 2359.00000 0.73700 13588 94 0.80800 0.33200
10028 778428108 Existing Customer 36 F 1 Graduate Single Less than $40K Blue 16 6 5 3 6091.00000 1184 4907.00000 0.76600 17437 113 0.76600 0.19400
10029 715268508 Existing Customer 42 M 3 Graduate Married $40K - $60K Gold 36 3 2 5 23981.00000 1399 22582.00000 0.71200 14840 125 0.76100 0.05800
10030 788741733 Existing Customer 53 F 1 Graduate Married Less than $40K Blue 32 4 2 3 7832.00000 1784 6048.00000 0.71100 14614 106 0.65600 0.22800
10031 717355533 Existing Customer 56 M 3 Graduate Single $120K + Blue 44 3 1 2 34516.00000 1307 33209.00000 0.55800 13759 120 0.76500 0.03800
10032 712128258 Existing Customer 49 F 4 High School Single Less than $40K Silver 40 5 3 5 13704.00000 2127 11577.00000 0.80500 15872 116 0.68100 0.15500
10033 735825258 Existing Customer 39 M 2 High School Single Less than $40K Blue 32 4 1 4 4423.00000 1643 2780.00000 0.72400 15467 112 0.62300 0.37100
10034 714544908 Attrited Customer 56 M 4 Graduate Divorced $60K - $80K Blue 36 6 3 3 6224.00000 0 6224.00000 0.92000 8979 68 0.58100 0.00000
10035 818712633 Existing Customer 41 F 3 Graduate Single Less than $40K Blue 36 4 6 2 4133.00000 1067 3066.00000 0.63200 16100 119 0.60800 0.25800
10036 709831983 Existing Customer 57 M 2 Uneducated Single $120K + Blue 46 3 2 3 34516.00000 1244 33272.00000 0.60600 15176 126 0.80000 0.03600
10037 789398733 Existing Customer 56 M 3 Graduate Married $120K + Blue 46 3 3 2 5270.00000 779 4491.00000 0.73100 16179 105 0.69400 0.14800
10038 716850558 Existing Customer 48 M 4 Graduate NaN $60K - $80K Gold 32 5 1 3 34516.00000 1285 33231.00000 0.65400 15279 122 0.64900 0.03700
10039 717316008 Existing Customer 38 M 2 NaN Married $120K + Blue 36 6 2 3 3782.00000 0 3782.00000 0.71700 14977 117 0.72100 0.00000
10040 714677583 Attrited Customer 50 F 3 Doctorate Single abc Blue 36 4 3 3 5173.00000 0 5173.00000 0.91200 8757 68 0.78900 0.00000
10041 767348733 Existing Customer 56 M 2 NaN Married $60K - $80K Blue 49 3 3 3 4058.00000 793 3265.00000 0.75800 15865 105 0.66700 0.19500
10042 779900808 Existing Customer 32 M 0 College Single $60K - $80K Blue 16 2 3 2 21551.00000 0 21551.00000 0.53600 14950 94 0.84300 0.00000
10043 719900733 Existing Customer 60 M 1 High School Married $40K - $60K Blue 47 3 1 3 9805.00000 1007 8798.00000 0.80000 14641 95 0.90000 0.10300
10044 816517683 Existing Customer 38 F 2 Doctorate Married Less than $40K Blue 33 4 3 3 7530.00000 1336 6194.00000 0.74900 14719 115 0.85500 0.17700
10045 755305683 Existing Customer 45 M 5 Graduate Married $60K - $80K Blue 38 5 1 4 8983.00000 0 8983.00000 0.71300 15163 124 0.74600 0.00000
10046 715769958 Existing Customer 53 F 3 NaN Married $40K - $60K Blue 39 3 2 2 4855.00000 1912 2943.00000 0.60600 14969 104 0.76300 0.39400
10047 715493733 Attrited Customer 53 M 2 Graduate Married $40K - $60K Blue 36 2 3 4 5735.00000 0 5735.00000 1.00900 9088 64 0.68400 0.00000
10048 827440458 Attrited Customer 48 M 4 College Married $80K - $120K Blue 44 2 3 3 9524.00000 1449 8075.00000 0.55900 7527 58 0.65700 0.15200
10049 715713408 Existing Customer 55 M 2 Graduate Married Less than $40K Gold 46 3 5 3 15987.00000 0 15987.00000 0.82000 15853 117 0.56000 0.00000
10050 720128658 Attrited Customer 49 M 1 Uneducated Single $120K + Blue 30 5 3 3 5181.00000 0 5181.00000 0.83000 8943 68 1.00000 0.00000
10051 770038983 Attrited Customer 37 M 4 High School Single $60K - $80K Silver 31 1 3 3 28831.00000 2517 26314.00000 0.77000 8688 69 0.76900 0.08700
10052 714684108 Attrited Customer 31 M 1 Post-Graduate Single $40K - $60K Blue 24 4 2 4 4169.00000 534 3635.00000 0.83600 8565 70 0.79500 0.12800
10053 708942408 Existing Customer 46 F 3 Uneducated Single $40K - $60K Blue 36 3 2 2 7020.00000 1784 5236.00000 0.81500 16937 116 0.73100 0.25400
10054 779134458 Attrited Customer 33 F 1 Doctorate Single Less than $40K Blue 15 1 1 3 3709.00000 1180 2529.00000 0.64500 8130 74 0.89700 0.31800
10055 710066133 Attrited Customer 30 M 0 Graduate Single Less than $40K Silver 36 4 3 3 12833.00000 0 12833.00000 0.85300 9619 75 0.78600 0.00000
10056 709578258 Existing Customer 40 F 3 Uneducated Single Less than $40K Blue 33 4 3 2 3804.00000 1917 1887.00000 0.74400 14321 102 0.75900 0.50400
10057 718043583 Existing Customer 50 M 2 Uneducated Single $120K + Blue 40 5 3 4 21794.00000 1368 20426.00000 0.69100 15111 116 0.70600 0.06300
10058 715186608 Existing Customer 43 M 4 Graduate Married $80K - $120K Blue 35 3 2 3 3304.00000 2517 787.00000 0.74100 14565 120 0.81800 0.76200
10059 708447858 Existing Customer 59 F 2 Uneducated Married Less than $40K Blue 36 6 3 2 3850.00000 0 3850.00000 0.50900 15107 117 0.91800 0.00000
10060 718071933 Existing Customer 49 M 1 Graduate Married $40K - $60K Blue 36 4 2 3 4158.00000 1889 2269.00000 0.87200 15842 130 0.83100 0.45400
10061 771217758 Existing Customer 44 F 2 High School Single Less than $40K Blue 31 3 2 3 4182.00000 2028 2154.00000 0.65800 16920 114 0.60600 0.48500
10062 721210158 Existing Customer 29 M 2 College Married $40K - $60K Blue 17 3 1 3 4626.00000 1232 3394.00000 0.73100 14740 102 0.75900 0.26600
10063 788779983 Existing Customer 58 M 2 Post-Graduate Married $80K - $120K Blue 39 4 2 3 3742.00000 2517 1225.00000 0.75000 16401 121 0.80600 0.67300
10064 718614183 Attrited Customer 43 F 4 High School NaN Less than $40K Silver 31 6 2 2 13651.00000 0 13651.00000 1.04600 9772 71 0.77500 0.00000
10065 709865883 Attrited Customer 38 M 2 High School Divorced $60K - $80K Silver 25 1 2 3 26794.00000 821 25973.00000 0.84800 8788 58 0.93300 0.03100
10066 766273533 Existing Customer 53 M 2 Uneducated NaN $80K - $120K Blue 46 3 6 3 17190.00000 1736 15454.00000 0.69800 15055 117 0.69600 0.10100
10067 717390408 Attrited Customer 49 F 4 Uneducated Married $40K - $60K Blue 36 5 2 3 4167.00000 0 4167.00000 0.58100 7590 87 0.77600 0.00000
10068 718655808 Existing Customer 44 F 3 Uneducated Married Less than $40K Silver 36 3 3 2 12944.00000 1200 11744.00000 0.83900 16161 105 0.94400 0.09300
10069 817035108 Existing Customer 31 M 2 Graduate Single Less than $40K Blue 14 4 3 3 6933.00000 1870 5063.00000 0.65600 15585 115 0.91700 0.27000
10070 747551658 Existing Customer 47 M 3 High School NaN $80K - $120K Silver 40 5 3 2 34516.00000 1371 33145.00000 0.69100 15930 123 0.83600 0.04000
10071 713281833 Attrited Customer 37 M 3 NaN Single $40K - $60K Blue 29 6 2 3 13589.00000 2517 11072.00000 0.94100 9456 79 1.02600 0.18500
10072 771982083 Existing Customer 44 M 4 Doctorate Married $80K - $120K Blue 33 3 1 3 4552.00000 1144 3408.00000 0.70900 14418 104 0.67700 0.25100
10073 717642633 Existing Customer 51 M 2 Graduate Married $60K - $80K Blue 40 3 3 3 3750.00000 1801 1949.00000 0.88900 17995 116 0.65700 0.48000
10074 772721208 Existing Customer 50 M 2 High School Single $60K - $80K Blue 40 4 3 3 17655.00000 1720 15935.00000 0.70600 16293 120 0.73900 0.09700
10075 714385983 Existing Customer 57 F 2 Uneducated Single Less than $40K Blue 36 5 3 2 8224.00000 1239 6985.00000 0.79300 16319 124 0.65300 0.15100
10076 712516608 Existing Customer 43 M 3 Graduate Married $80K - $120K Blue 36 6 2 3 3946.00000 1275 2671.00000 0.64700 14822 96 0.81100 0.32300
10077 714725883 Existing Customer 41 F 4 Graduate Divorced $40K - $60K Blue 36 4 1 0 5654.00000 2470 3184.00000 0.63700 15123 111 0.88100 0.43700
10078 709398708 Existing Customer 40 M 3 High School Divorced $60K - $80K Blue 36 3 1 2 16268.00000 2517 13751.00000 0.65200 15925 124 0.82400 0.15500
10079 715090008 Existing Customer 45 M 4 Graduate Single $60K - $80K Silver 36 2 3 4 28564.00000 2517 26047.00000 0.71100 14875 107 0.78300 0.08800
10080 709159308 Existing Customer 42 F 4 High School Single Less than $40K Blue 30 3 1 3 6172.00000 1212 4960.00000 0.72300 16622 99 0.65000 0.19600
10081 711348033 Existing Customer 50 M 2 Graduate Married $40K - $60K Blue 40 4 3 2 5435.00000 1243 4192.00000 0.74900 16202 117 0.67100 0.22900
10082 788230158 Existing Customer 29 M 1 NaN Married $40K - $60K Silver 20 3 1 2 16496.00000 1409 15087.00000 0.67700 16715 126 0.70300 0.08500
10083 718170033 Existing Customer 42 F 4 Uneducated Married abc Blue 23 4 1 2 8348.00000 0 8348.00000 0.69500 15905 111 0.70800 0.00000
10084 718161708 Existing Customer 46 F 4 Graduate Married Less than $40K Blue 34 5 1 2 7562.00000 1500 6062.00000 0.63200 14908 100 0.75400 0.19800
10085 717714633 Existing Customer 49 M 3 Uneducated Married $120K + Gold 38 4 3 4 14938.00000 0 14938.00000 0.73700 15277 131 0.72400 0.00000
10086 824150733 Existing Customer 43 M 2 High School Married $80K - $120K Silver 39 3 1 3 34516.00000 1987 32529.00000 0.67300 16177 122 0.84800 0.05800
10087 713768358 Existing Customer 45 M 4 Graduate Single $40K - $60K Blue 35 3 2 2 7935.00000 888 7047.00000 0.77900 15380 122 0.69400 0.11200
10088 709188108 Existing Customer 45 M 2 Graduate Single $60K - $80K Silver 33 4 2 2 34516.00000 1529 32987.00000 0.60900 13940 105 0.81000 0.04400
10089 720328983 Attrited Customer 52 F 5 NaN Married Less than $40K Blue 36 4 3 3 9611.00000 0 9611.00000 0.84000 7636 64 0.82900 0.00000
10090 712296408 Existing Customer 36 F 3 NaN Married $40K - $60K Blue 22 5 3 3 12958.00000 2273 10685.00000 0.60800 15681 96 0.62700 0.17500
10091 719106783 Existing Customer 52 M 1 High School Single $80K - $120K Blue 41 4 1 4 4103.00000 1972 2131.00000 0.66500 16344 118 0.78800 0.48100
10092 713437008 Attrited Customer 40 F 3 Graduate Married abc Blue 25 1 2 3 6888.00000 1878 5010.00000 1.05900 9038 64 0.82900 0.27300
10093 719574033 Existing Customer 38 M 1 High School Single $60K - $80K Blue 28 2 2 2 21906.00000 0 21906.00000 0.69600 15349 119 0.70000 0.00000
10094 716123283 Existing Customer 59 M 1 NaN Single $60K - $80K Blue 48 3 1 2 7288.00000 0 7288.00000 0.64000 14873 120 0.71400 0.00000
10095 789023508 Existing Customer 46 M 3 NaN Married $80K - $120K Blue 33 4 1 3 34516.00000 1099 33417.00000 0.81600 15490 110 0.61800 0.03200
10096 710005683 Existing Customer 56 M 2 College Single $80K - $120K Blue 46 3 3 5 12540.00000 1696 10844.00000 0.79900 16518 115 0.71600 0.13500
10097 801036033 Existing Customer 31 M 0 High School Single $40K - $60K Blue 25 3 2 3 4493.00000 1388 3105.00000 0.79500 17744 104 0.76300 0.30900
10098 716644008 Attrited Customer 55 M 3 Graduate Single $120K + Silver 36 4 3 4 34516.00000 0 34516.00000 1.00700 9931 70 0.75000 0.00000
10099 709094358 Existing Customer 51 F 1 Graduate Single Less than $40K Blue 41 4 2 3 8900.00000 798 8102.00000 0.64700 16737 88 0.66000 0.09000
10100 719848008 Existing Customer 39 M 2 Graduate NaN $60K - $80K Silver 36 4 2 2 29808.00000 0 29808.00000 0.66900 16098 128 0.68400 0.00000
10101 718372458 Existing Customer 42 M 2 Graduate NaN $40K - $60K Blue 30 3 2 5 3735.00000 1723 2012.00000 0.59500 14501 92 0.84000 0.46100
10102 720608658 Existing Customer 33 F 1 Uneducated Single Less than $40K Blue 36 5 3 3 8398.00000 1875 6523.00000 0.72700 16706 123 0.75700 0.22300
10103 717185658 Attrited Customer 51 M 1 High School Married $80K - $120K Blue 36 4 3 4 22754.00000 0 22754.00000 0.79900 8531 77 0.79100 0.00000
10104 715474083 Existing Customer 51 M 3 Graduate Single $60K - $80K Silver 36 3 2 2 29663.00000 1743 27920.00000 0.66700 14638 93 0.72200 0.05900
10105 709646433 Attrited Customer 59 F 1 High School Married Less than $40K Blue 50 1 4 3 5043.00000 743 4300.00000 0.80500 10170 66 0.78400 0.14700
10106 717494358 Existing Customer 58 F 0 Graduate Single Less than $40K Blue 48 2 2 5 4299.00000 1334 2965.00000 0.66000 15068 123 0.68500 0.31000
10107 713924283 Attrited Customer 61 M 0 Graduate Single $60K - $80K Blue 54 2 1 4 11859.00000 1644 10215.00000 0.86600 8930 79 0.83700 0.13900
10108 714471183 Attrited Customer 47 M 4 Graduate Divorced $80K - $120K Blue 39 4 3 4 17504.00000 476 17028.00000 0.89200 10468 66 0.73700 0.02700
10109 780613758 Existing Customer 47 M 5 High School Single Less than $40K Blue 35 4 3 5 4165.00000 0 4165.00000 0.81300 17093 111 0.82000 0.00000
10110 718225683 Existing Customer 56 M 1 Graduate Single $80K - $120K Silver 49 5 2 2 34516.00000 1091 33425.00000 0.64000 15274 108 0.71400 0.03200
10111 710734308 Existing Customer 49 M 1 Graduate Single $60K - $80K Blue 40 6 3 3 6481.00000 1569 4912.00000 0.69200 15937 119 0.80300 0.24200
10112 708564858 Attrited Customer 33 M 2 College Married $120K + Gold 20 2 1 4 34516.00000 0 34516.00000 1.00400 9338 73 0.62200 0.00000
10113 713733633 Attrited Customer 27 M 0 High School Divorced $60K - $80K Blue 36 2 3 2 13303.00000 2517 10786.00000 0.92900 10219 85 0.80900 0.18900
10114 720841533 Existing Customer 29 M 0 Graduate Married Less than $40K Blue 15 3 1 5 4700.00000 0 4700.00000 0.61700 14723 96 0.65500 0.00000
10115 712210458 Existing Customer 38 M 1 Uneducated Single $40K - $60K Blue 36 2 3 2 5639.00000 1558 4081.00000 0.61400 16628 109 0.94600 0.27600
10116 714109308 Existing Customer 46 M 5 College Single $80K - $120K Blue 36 1 2 3 13187.00000 2241 10946.00000 0.68900 15354 112 0.93100 0.17000
10117 712503408 Existing Customer 57 M 2 Graduate Married $80K - $120K Blue 40 6 3 4 17925.00000 1909 16016.00000 0.71200 17498 111 0.82000 0.10600
10118 713755458 Attrited Customer 50 M 1 NaN NaN $80K - $120K Blue 36 6 3 4 9959.00000 952 9007.00000 0.82500 10310 63 1.10000 0.09600
10119 716893683 Attrited Customer 55 F 3 Uneducated Single abc Blue 47 4 3 3 14657.00000 2517 12140.00000 0.16600 6009 53 0.51400 0.17200
10120 710841183 Existing Customer 54 M 1 High School Single $60K - $80K Blue 34 5 2 0 13940.00000 2109 11831.00000 0.66000 15577 114 0.75400 0.15100
10121 713899383 Existing Customer 56 F 1 Graduate Single Less than $40K Blue 50 4 1 4 3688.00000 606 3082.00000 0.57000 14596 120 0.79100 0.16400
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 3 2 3 4003.00000 1851 2152.00000 0.70300 15476 117 0.85700 0.46200
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.00000 2186 2091.00000 0.80400 8764 69 0.68300 0.51100
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.00000 0 5409.00000 0.81900 10291 60 0.81800 0.00000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.00000 0 5281.00000 0.53500 8395 62 0.72200 0.00000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 6 2 4 10388.00000 1961 8427.00000 0.70300 10294 61 0.64900 0.18900
In [115]:
#Understand the  dataset.
#get the size of dataframe
print ("Rows     : " , dfc.shape[0])  #get number of rows/observations
print ("Columns  : " , dfc.shape[1]) #get number of columns
print ("#"*40,"\n","Features : \n\n", dfc.columns.tolist()) #get name of columns/features
missing_df = pd.DataFrame({
    "Missing": dfc.isnull().sum(),
    "Missing %": round((dfc.isnull().sum()/ dfc.isna().count()*100), 2)
})
display(missing_df.sort_values(by='Missing', ascending=False))
Rows     :  10127
Columns  :  21
######################################## 
 Features : 

 ['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
Missing Missing %
Education_Level 1519 15.00000
Marital_Status 749 7.40000
CLIENTNUM 0 0.00000
Contacts_Count_12_mon 0 0.00000
Total_Ct_Chng_Q4_Q1 0 0.00000
Total_Trans_Ct 0 0.00000
Total_Trans_Amt 0 0.00000
Total_Amt_Chng_Q4_Q1 0 0.00000
Avg_Open_To_Buy 0 0.00000
Total_Revolving_Bal 0 0.00000
Credit_Limit 0 0.00000
Total_Relationship_Count 0 0.00000
Months_Inactive_12_mon 0 0.00000
Attrition_Flag 0 0.00000
Months_on_book 0 0.00000
Card_Category 0 0.00000
Income_Category 0 0.00000
Dependent_count 0 0.00000
Gender 0 0.00000
Customer_Age 0 0.00000
Avg_Utilization_Ratio 0 0.00000
In [116]:
#### Check the data types of the columns for the dataset.
dfc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

Observations

  • Customer_Age,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct, Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio, are all continous varaibles, rest are categorical variables
  • Attrition_Flag is the Target variable.
  • There are no missing values.

Summary of the dataset.¶

In [117]:
dfc.describe().T
Out[117]:
count mean std min 25% 50% 75% max
CLIENTNUM 10127.00000 739177606.33366 36903783.45023 708082083.00000 713036770.50000 717926358.00000 773143533.00000 828343083.00000
Customer_Age 10127.00000 46.32596 8.01681 26.00000 41.00000 46.00000 52.00000 73.00000
Dependent_count 10127.00000 2.34620 1.29891 0.00000 1.00000 2.00000 3.00000 5.00000
Months_on_book 10127.00000 35.92841 7.98642 13.00000 31.00000 36.00000 40.00000 56.00000
Total_Relationship_Count 10127.00000 3.81258 1.55441 1.00000 3.00000 4.00000 5.00000 6.00000
Months_Inactive_12_mon 10127.00000 2.34117 1.01062 0.00000 2.00000 2.00000 3.00000 6.00000
Contacts_Count_12_mon 10127.00000 2.45532 1.10623 0.00000 2.00000 2.00000 3.00000 6.00000
Credit_Limit 10127.00000 8631.95370 9088.77665 1438.30000 2555.00000 4549.00000 11067.50000 34516.00000
Total_Revolving_Bal 10127.00000 1162.81406 814.98734 0.00000 359.00000 1276.00000 1784.00000 2517.00000
Avg_Open_To_Buy 10127.00000 7469.13964 9090.68532 3.00000 1324.50000 3474.00000 9859.00000 34516.00000
Total_Amt_Chng_Q4_Q1 10127.00000 0.75994 0.21921 0.00000 0.63100 0.73600 0.85900 3.39700
Total_Trans_Amt 10127.00000 4404.08630 3397.12925 510.00000 2155.50000 3899.00000 4741.00000 18484.00000
Total_Trans_Ct 10127.00000 64.85869 23.47257 10.00000 45.00000 67.00000 81.00000 139.00000
Total_Ct_Chng_Q4_Q1 10127.00000 0.71222 0.23809 0.00000 0.58200 0.70200 0.81800 3.71400
Avg_Utilization_Ratio 10127.00000 0.27489 0.27569 0.00000 0.02300 0.17600 0.50300 0.99900

Observations

  • The average customer age is ~46 and max customer age is 73.
  • The average duration of customer relationship with the bank is ~35 months with minimum of 13 and max of 56.
  • Maximum Total number of product held by customer is 6 and on average is ~4.
  • Mean Credit_limit 8631 while median is 4549 , indicates data has outliers and right skewed.
  • Total_Revolving_Bal has mean as 1162.
  • Avg_Open_To_Buy has mean 7469 and max as 34516 .This number had appeared before in credit limit. This seems to be some default value. Distrubution is right skewed with some outliers on higher end.
  • Total_Amt_Chng_Q4_Q1 median is 0.73600 and mean is 0.75994.
  • Total_Trans_Amt has an average of 4404 and median of 3899. This indicate data is right skewed with outliers on higher end
  • Total_Trans_Ct has an average value of 64.8 and median of 67. This ndicates slight skewness to the right.
  • Total_Ct_Chng_Q4_Q1 has an average of 0.71 and median value of 0.702.
  • Avg_Utilization_Ratio is right skewed with an average of 0.27 and median at 0.176.

Preprocessing¶

Droping Variables¶

In [118]:
dfc.drop(['CLIENTNUM'],axis=1,inplace=True)
In [119]:
cat_cols = ['Attrition_Flag','Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category','Dependent_count','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon']
In [120]:
for col in cat_cols:
    print(f"Feature: {col}")
    print("-"*40)
    display(pd.DataFrame({"Counts": dfc[col].value_counts(dropna=False)}).sort_values(by='Counts', ascending=False))
Feature: Attrition_Flag
----------------------------------------
Counts
Existing Customer 8500
Attrited Customer 1627
Feature: Gender
----------------------------------------
Counts
F 5358
M 4769
Feature: Education_Level
----------------------------------------
Counts
Graduate 3128
High School 2013
NaN 1519
Uneducated 1487
College 1013
Post-Graduate 516
Doctorate 451
Feature: Marital_Status
----------------------------------------
Counts
Married 4687
Single 3943
NaN 749
Divorced 748
Feature: Income_Category
----------------------------------------
Counts
Less than $40K 3561
$40K - $60K 1790
$80K - $120K 1535
$60K - $80K 1402
abc 1112
$120K + 727
Feature: Card_Category
----------------------------------------
Counts
Blue 9436
Silver 555
Gold 116
Platinum 20
Feature: Dependent_count
----------------------------------------
Counts
3 2732
2 2655
1 1838
4 1574
0 904
5 424
Feature: Total_Relationship_Count
----------------------------------------
Counts
3 2305
4 1912
5 1891
6 1866
2 1243
1 910
Feature: Months_Inactive_12_mon
----------------------------------------
Counts
3 3846
2 3282
1 2233
4 435
5 178
6 124
0 29
Feature: Contacts_Count_12_mon
----------------------------------------
Counts
3 3380
2 3227
1 1499
4 1392
0 399
5 176
6 54

Observations

  • 1657 customers has attrited.
  • Education level,Income,martial status has Unknown category , this will have to be treated as missing value and will have to be imputed.
  • Blue card has maxiumum customers.
In [121]:
## Converting the data type of categorical features to 'category'

cat_cols = ['Attrition_Flag','Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category','Dependent_count','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon']

dfc[cat_cols] = dfc[cat_cols].astype('category')
dfc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Attrition_Flag            10127 non-null  category
 1   Customer_Age              10127 non-null  int64   
 2   Gender                    10127 non-null  category
 3   Dependent_count           10127 non-null  category
 4   Education_Level           8608 non-null   category
 5   Marital_Status            9378 non-null   category
 6   Income_Category           10127 non-null  category
 7   Card_Category             10127 non-null  category
 8   Months_on_book            10127 non-null  int64   
 9   Total_Relationship_Count  10127 non-null  category
 10  Months_Inactive_12_mon    10127 non-null  category
 11  Contacts_Count_12_mon     10127 non-null  category
 12  Credit_Limit              10127 non-null  float64 
 13  Total_Revolving_Bal       10127 non-null  int64   
 14  Avg_Open_To_Buy           10127 non-null  float64 
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64 
 16  Total_Trans_Amt           10127 non-null  int64   
 17  Total_Trans_Ct            10127 non-null  int64   
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64 
 19  Avg_Utilization_Ratio     10127 non-null  float64 
dtypes: category(10), float64(5), int64(5)
memory usage: 892.3 KB
In [122]:
dfc.describe(include=['category']).T
Out[122]:
count unique top freq
Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Dependent_count 10127 6 3 2732
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436
Total_Relationship_Count 10127 6 3 2305
Months_Inactive_12_mon 10127 7 3 3846
Contacts_Count_12_mon 10127 7 3 3380

Age¶

Age can be a vital factor in tourism, converting ages to bin to explore if there is any pattern

In [123]:
dfc.Customer_Age.describe()
Out[123]:
count   10127.00000
mean       46.32596
std         8.01681
min        26.00000
25%        41.00000
50%        46.00000
75%        52.00000
max        73.00000
Name: Customer_Age, dtype: float64
In [124]:
dfc['Agebin'] = pd.cut(dfc['Customer_Age'], bins = [25, 35,45,55,65, 75], labels = ['25-35', '36-45', '46-55', '56-65','66-75'])
In [125]:
dfc.Agebin.value_counts()
Out[125]:
46-55    4135
36-45    3742
56-65    1321
25-35     919
66-75      10
Name: Agebin, dtype: int64

EDA¶

In [126]:
def dist_box(data):
 # function plots a combined graph for univariate analysis of continous variable 
 #to check spread, central tendency , dispersion and outliers  
    Name=data.name.upper()
    fig,(ax_box,ax_dis)  =plt.subplots(nrows=2,sharex=True,gridspec_kw = {"height_ratios": (.25, .75)},figsize=(8, 5))
    mean=data.mean()
    median=data.median()
    mode=data.mode().tolist()[0]
    sns.set_theme(style="white")
    fig.suptitle("SPREAD OF DATA FOR "+ Name  , fontsize=18, fontweight='bold')
    sns.boxplot(x=data,showmeans=True, orient='h',color="tan",ax=ax_box)
    ax_box.set(xlabel='')
     # just trying to make visualisation better. This will set background to white
    sns.despine(top=True,right=True,left=True) # to remove side line from graph
    sns.distplot(data,kde=False,color='red',ax=ax_dis)
    ax_dis.axvline(mean, color='r', linestyle='--',linewidth=2)
    ax_dis.axvline(median, color='g', linestyle='-',linewidth=2)
    plt.legend({'Mean':mean,'Median':median})
                    
In [127]:
#select all quantitative columns for checking the spread
list_col=  dfc.select_dtypes(include='number').columns.to_list()
for i in range(len(list_col)):
    dist_box(dfc[list_col[i]])

Observations

  • Customer Age is almost normally disturbuted, with some outlier on higher end.
  • Month on book has maximum distrubution around 35 to 36. It has many outliers on lower and higher end.
  • Most customers have a credit card for 35 to 36 months
  • Credit card limit is right skewed , with a sudden pick at 35000, as seen before this is maxiumum limit and seems to be some kind of default value.There are lot of outliers on higher end. Customers above 25000 need to beinvestigated further.
  • Total Revolving bal seems to have different disturbution with many customers with ~0 revolving balance and then it follows almost normal distrubution and then sudden peak at 2500.
  • Average open to buy has same distribution as Credit card limit.
  • Total Amt change has lot of outliers on lower and upper end. There are some 3.5 ratio of total amount change from Q4 to Q1,this customers which needs to be investigated further.
  • Total trans amt also has very different distrubution with data between 0 -2500 , then 2500-5000, and then 750-10000 and then 12500-17500. It has lot of outliers on higher end.
  • Total_trans_ct also has 3 modal with outliers on higher end.
  • Total ct change q4_q1 has normal like disturbtion with lot of outliers on higher and lower end.
  • Avg_utlization ration is measure of how much of the given credit limit is the customer actually using. it ranges from 0.0 to 1
In [128]:
# Making a list of all categorical variables

plt.figure(figsize=(15,20))

sns.set_theme(style="white") 
for i, variable in enumerate(cat_cols):
                     plt.subplot(9,2,i+1)
                     order = dfc[variable].value_counts(ascending=False).index   
                     #sns.set_palette(list_palette[i]) # to set the palette
                     sns.set_palette('twilight_shifted')
                     ax=sns.countplot(x=dfc[variable], data=dfc )
                     sns.despine(top=True,right=True,left=True) # to remove side line from graph
                     for p in ax.patches:
                           percentage = '{:.1f}%'.format(100 * p.get_height()/len(dfc[variable]))
                           x = p.get_x() + p.get_width() / 2 - 0.05
                           y = p.get_y() + p.get_height()
                           plt.annotate(percentage, (x, y),ha='center')
                     plt.tight_layout()
                     plt.title(cat_cols[i].upper())
                                     

Observations

  • ~16% of credit card customers attrited.
  • ~ 52 % are female customers who have credit cards.
  • ~ 30 % customers are graduate. There are very few post graduate and doctorate customers.
  • ~46 % are married customers. 7.4 % unknown status needs to be imputed.
  • ~ 35% earn less than 40 k.
  • ~ 93 % have blue card. Very less customers have a plantinum card.
  • ~22 % have more than 3 bank products
  • ~38 % are inactive from 3 months. Customers who are inactive from 4,5,6 month should be investigated more to see if there is any relationship with attrition
  • ~60 % for contacted 2-3 times in 12 month.
In [129]:
sns.set_palette(sns.color_palette("Set2", 8))
plt.figure(figsize=(15,12))
sns.heatmap(dfc.corr(),annot=True)
plt.show()
In [130]:
sns.set_palette(sns.color_palette("Set1", 8))
sns.pairplot(dfc, hue="Attrition_Flag",corner=True)
plt.show()

Observations

  • Customer age and number of books are highly correlated.
  • credit limit and Avg utlization ration has some negative correlation.
  • Total revolving balance and average utlization are positively correlated.
  • Average opening balance is negatively correlated to avg utlization ratio.
  • There is very little correlation between total transfer amount and credit limit
  • As expected there is very high correlation total transfer amount and total transfer count.
  • Credit limit and Average open to buy is fully correlated, we can drop one of them.
  • It is also logical that Total_Trans_Amt is correlated to Total_Amt_Chng_Q4_Q1,total ct_change_q4_Q1 . These features seems to be derived from Total_Trans_Amt. May be we can drop one of these columns.
In [131]:
### Function to plot distributions and Boxplots of customers
def plot(x,target='Attrition_Flag'):
    fig,axs = plt.subplots(2,2,figsize=(12,10))
    axs[0, 0].set_title(f'Distribution of {x} \n of a existing customer',fontsize=12,fontweight='bold')
    sns.distplot(dfc[(dfc[target] == 'Existing Customer')][x],ax=axs[0,0],color='teal')
    axs[0, 1].set_title(f"Distribution of {x}\n of a  attrited customer ",fontsize=12,fontweight='bold')
    sns.distplot(dfc[(dfc[target] == 'Attrited Customer')][x],ax=axs[0,1],color='orange')
    axs[1,0].set_title(f'Boxplot of {x} w.r.t attrited customer',fontsize=12,fontweight='bold')
    
    line = plt.Line2D((.1,.9),(.5,.5), color='grey', linewidth=1.5,linestyle='--')
    fig.add_artist(line)
   
    sns.boxplot(dfc[target],dfc[x],ax=axs[1,0],palette='gist_rainbow',showmeans=True)
    axs[1,1].set_title(f'Boxplot of {x} w.r.t Attrited customer - Without outliers',fontsize=12,fontweight='bold')
    sns.boxplot(dfc[target],dfc[x],ax=axs[1,1],showfliers=False,palette='gist_rainbow',showmeans=True) #turning off outliers from boxplot
    sns.despine(top=True,right=True,left=True) # to remove side line from graph
    plt.tight_layout(pad=4)
    plt.show()
In [132]:
#select all quantitative columns for checking the spread
#list_col=  ['Age','DurationOfPitch','MonthlyIncome']
list_col=dfc.select_dtypes(include='number').columns.to_list()
#print(list_col)
#plt.figure(figsize=(14,23))
for j in range(len(list_col)):
   plot(list_col[j])
   

Observation

  • There is no difference in Age, months on book,credit limit,average open to buy, of attrited and existing customers.
  • There doesnt seem to exist any relation between age and attrition.
  • Existing customers have a higher total revolving balance than customers who have attrited.
  • Customers with lesser transaction amount spend and low change in transaction_spend_Q1_Q4 were more likely to attrite.
  • The customers with low number of transactions and low change in number of transactions between Q1 and Q4 attrited.
  • On average, customers with less utlization attrited.
In [133]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 8))

sns.barplot(y='Credit_Limit',x='Income_Category',hue='Attrition_Flag',data=dfc)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Income vs credit')
Out[133]:
Text(0.5, 1.0, 'Income vs credit')
In [134]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(y='Credit_Limit',x='Education_Level',hue='Attrition_Flag',data=dfc)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Education')
Out[134]:
Text(0.5, 1.0, 'CustomerAge  vs Education')
In [135]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(x='Agebin',y='Credit_Limit',hue='Attrition_Flag',data=dfc)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Credit limit')
Out[135]:
Text(0.5, 1.0, 'CustomerAge  vs Credit limit')
In [136]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(x='Agebin',y='Total_Revolving_Bal',hue='Attrition_Flag',data=dfc)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Total Revolving Balance')
Out[136]:
Text(0.5, 1.0, 'CustomerAge  vs Total Revolving Balance')
In [137]:
plt.figure(figsize=(10,5)) 
sns.set_palette(sns.color_palette("tab20", 9))
sns.barplot(x='Agebin',y='Total_Trans_Amt',hue='Attrition_Flag',data=dfc)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('CustomerAge  vs Total Transcational Amount')
Out[137]:
Text(0.5, 1.0, 'CustomerAge  vs Total Transcational Amount')
In [138]:
plt.figure(figsize=(10,5)) 
sns.barplot(y='Credit_Limit',x='Gender',hue='Attrition_Flag',data=dfc) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Credit limit  vs Gender')
Out[138]:
Text(0.5, 1.0, 'Credit limit  vs Gender')
In [139]:
plt.figure(figsize=(10,5))
sns.barplot(y='Credit_Limit',x='Card_Category',hue='Attrition_Flag',data=dfc) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Credit Limit  vs Card Category')
Out[139]:
Text(0.5, 1.0, 'Credit Limit  vs Card Category')
In [140]:
plt.figure(figsize=(10,5))
sns.barplot(y='Total_Trans_Amt',x='Card_Category',hue='Attrition_Flag',data=dfc) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Total Transcation Amount  vs Card')
Out[140]:
Text(0.5, 1.0, 'Total Transcation Amount  vs Card')
In [141]:
plt.figure(figsize=(10,5))
sns.barplot(y='Total_Trans_Ct',x='Card_Category',hue='Attrition_Flag',data=dfc) 
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Total Transcation Count  vs Card Category')
Out[141]:
Text(0.5, 1.0, 'Total Transcation Count  vs Card Category')

Observation

  • Customer with 35-55 were given more credit limit.
  • Plantinum card holder had higher credit limit
  • Customer earning more than 120 k had higher credit limit
  • Male customer were given more credit limit than female
In [142]:
## Function to plot stacked bar chart
def stacked_plot(x):
    sns.set_palette(sns.color_palette("tab20", 8))
    tab1 = pd.crosstab(x,dfc['Attrition_Flag'],margins=True)
    display(tab1)
    tab = pd.crosstab(x,dfc['Attrition_Flag'],normalize='index')
    tab.plot(kind='bar',stacked=True,figsize=(9,5))
    plt.xticks(rotation=360)
    #labels=["No","Yes"]
    plt.legend(loc='lower left', frameon=False,)
    plt.legend(loc="upper left",title=" ",bbox_to_anchor=(1,1))
    sns.despine(top=True,right=True,left=True) # to remove side line from graph
    #plt.legend(labels)
    plt.show()
In [143]:
cat_cols.append("Agebin")
for i, variable in enumerate(cat_cols):
       stacked_plot(dfc[variable])
Attrition_Flag Attrited Customer Existing Customer All
Attrition_Flag
Attrited Customer 1627 0 1627
Existing Customer 0 8500 8500
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Gender
F 930 4428 5358
M 697 4072 4769
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Education_Level
College 154 859 1013
Doctorate 95 356 451
Graduate 487 2641 3128
High School 306 1707 2013
Post-Graduate 92 424 516
Uneducated 237 1250 1487
All 1371 7237 8608
Attrition_Flag Attrited Customer Existing Customer All
Marital_Status
Divorced 121 627 748
Married 709 3978 4687
Single 668 3275 3943
All 1498 7880 9378
Attrition_Flag Attrited Customer Existing Customer All
Income_Category
$120K + 126 601 727
$40K - $60K 271 1519 1790
$60K - $80K 189 1213 1402
$80K - $120K 242 1293 1535
Less than $40K 612 2949 3561
abc 187 925 1112
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Card_Category
Blue 1519 7917 9436
Gold 21 95 116
Platinum 5 15 20
Silver 82 473 555
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Dependent_count
0 135 769 904
1 269 1569 1838
2 417 2238 2655
3 482 2250 2732
4 260 1314 1574
5 64 360 424
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Total_Relationship_Count
1 233 677 910
2 346 897 1243
3 400 1905 2305
4 225 1687 1912
5 227 1664 1891
6 196 1670 1866
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Months_Inactive_12_mon
0 15 14 29
1 100 2133 2233
2 505 2777 3282
3 826 3020 3846
4 130 305 435
5 32 146 178
6 19 105 124
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Contacts_Count_12_mon
0 7 392 399
1 108 1391 1499
2 403 2824 3227
3 681 2699 3380
4 315 1077 1392
5 59 117 176
6 54 0 54
All 1627 8500 10127
Attrition_Flag Attrited Customer Existing Customer All
Agebin
25-35 122 797 919
36-45 606 3136 3742
46-55 688 3447 4135
56-65 209 1112 1321
66-75 2 8 10
All 1627 8500 10127

Observations

  • It seems female customer attrited more compared to their male counterparts.
  • Higher educated customers who were doctorate or postgraduate attrited most.
  • Customers who were single attrited more.
  • Customers who earned more than 120k and less than 40k attritbutedd more.
  • Customers with plantinum card attrited more but there are only 20 samples so this is inclusive. Customers with gold attrited more compared to blue and silver card.
  • Customers with 3 dependent attrited more.
  • Customers having 1 or 2 bank products attrited more compared to customers with more bank products.
  • Customers who were never inactive attrited most.we can't be sure about this we have only 29 samples.Customers who were inactive for 4 months attrited most followed by 3 month and 5 month.
  • Customer in age range 66-75 attrited most, the customers in age range 36-55 are more likley to attrite.
In [144]:
#Profile of Attrited Customer with Blue Card 
dfc[(dfc['Card_Category']=='Blue') & (dfc['Attrition_Flag']=='Attrited Customer')].describe(include='all').T
Out[144]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 1519 1 Attrited Customer 1519 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 1519.00000 NaN NaN NaN 46.66228 7.69818 26.00000 41.00000 47.00000 52.00000 68.00000
Gender 1519 2 F 890 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 1519.00000 6.00000 3.00000 456.00000 NaN NaN NaN NaN NaN NaN NaN
Education_Level 1276 6 Graduate 450 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 1399 3 Married 674 NaN NaN NaN NaN NaN NaN NaN
Income_Category 1519 6 Less than $40K 586 NaN NaN NaN NaN NaN NaN NaN
Card_Category 1519 1 Blue 1519 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 1519.00000 NaN NaN NaN 36.20013 7.82959 13.00000 32.00000 36.00000 40.00000 56.00000
Total_Relationship_Count 1519.00000 6.00000 3.00000 386.00000 NaN NaN NaN NaN NaN NaN NaN
Months_Inactive_12_mon 1519.00000 7.00000 3.00000 769.00000 NaN NaN NaN NaN NaN NaN NaN
Contacts_Count_12_mon 1519.00000 7.00000 3.00000 634.00000 NaN NaN NaN NaN NaN NaN NaN
Credit_Limit 1519.00000 NaN NaN NaN 6817.74733 7470.18771 1438.30000 2004.50000 3841.00000 8313.50000 34516.00000
Total_Revolving_Bal 1519.00000 NaN NaN NaN 669.03555 922.33325 0.00000 0.00000 0.00000 1303.50000 2517.00000
Avg_Open_To_Buy 1519.00000 NaN NaN NaN 6148.71178 7500.48659 3.00000 1520.50000 3157.00000 7571.50000 34516.00000
Total_Amt_Chng_Q4_Q1 1519.00000 NaN NaN NaN 0.69230 0.21256 0.00000 0.54400 0.69800 0.85350 1.41100
Total_Trans_Amt 1519.00000 NaN NaN NaN 2954.16458 2180.36497 510.00000 1896.00000 2314.00000 2709.00000 10583.00000
Total_Trans_Ct 1519.00000 NaN NaN NaN 44.22646 14.13832 10.00000 37.00000 43.00000 50.50000 91.00000
Total_Ct_Chng_Q4_Q1 1519.00000 NaN NaN NaN 0.55036 0.22574 0.00000 0.40000 0.52400 0.68700 2.50000
Avg_Utilization_Ratio 1519.00000 NaN NaN NaN 0.17174 0.27103 0.00000 0.00000 0.00000 0.26150 0.99900
Agebin 1519 5 46-55 634 NaN NaN NaN NaN NaN NaN NaN
In [145]:
#Profile of Attrited Customer with gold Card 
dfc[(dfc['Card_Category']=='Gold') & (dfc['Attrition_Flag']=='Attrited Customer')].describe(include='all').T
Out[145]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 21 1 Attrited Customer 21 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 21.00000 NaN NaN NaN 43.85714 6.22323 32.00000 41.00000 44.00000 47.00000 59.00000
Gender 21 2 M 13 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 21.00000 6.00000 2.00000 7.00000 NaN NaN NaN NaN NaN NaN NaN
Education_Level 17 6 Graduate 6 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 17 2 Single 11 NaN NaN NaN NaN NaN NaN NaN
Income_Category 21 6 $60K - $80K 6 NaN NaN NaN NaN NaN NaN NaN
Card_Category 21 1 Gold 21 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 21.00000 NaN NaN NaN 33.90476 6.15552 20.00000 32.00000 36.00000 36.00000 48.00000
Total_Relationship_Count 21.00000 6.00000 2.00000 8.00000 NaN NaN NaN NaN NaN NaN NaN
Months_Inactive_12_mon 21.00000 3.00000 3.00000 14.00000 NaN NaN NaN NaN NaN NaN NaN
Contacts_Count_12_mon 21.00000 6.00000 3.00000 8.00000 NaN NaN NaN NaN NaN NaN NaN
Credit_Limit 21.00000 NaN NaN NaN 29878.52381 7832.30742 15109.00000 23981.00000 34516.00000 34516.00000 34516.00000
Total_Revolving_Bal 21.00000 NaN NaN NaN 1027.00000 1009.46223 0.00000 0.00000 897.00000 1847.00000 2517.00000
Avg_Open_To_Buy 21.00000 NaN NaN NaN 28851.52381 7757.52462 13640.00000 23981.00000 32315.00000 34516.00000 34516.00000
Total_Amt_Chng_Q4_Q1 21.00000 NaN NaN NaN 0.76990 0.25588 0.19600 0.59100 0.85800 0.97800 1.04700
Total_Trans_Amt 21.00000 NaN NaN NaN 5841.80952 2836.81313 1727.00000 2315.00000 6782.00000 8356.00000 9338.00000
Total_Trans_Ct 21.00000 NaN NaN NaN 59.85714 14.40238 34.00000 47.00000 64.00000 68.00000 89.00000
Total_Ct_Chng_Q4_Q1 21.00000 NaN NaN NaN 0.61905 0.20050 0.21400 0.47800 0.65000 0.75000 0.97800
Avg_Utilization_Ratio 21.00000 NaN NaN NaN 0.03629 0.03545 0.00000 0.00000 0.03200 0.06400 0.11900
Agebin 21 4 36-45 9 NaN NaN NaN NaN NaN NaN NaN
In [146]:
#Profile of Attrited Customer with silver  Card 
dfc[(dfc['Card_Category']=='Silver') & (dfc['Attrition_Flag']=='Attrited Customer')].describe(include='all').T
Out[146]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 82 1 Attrited Customer 82 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 82.00000 NaN NaN NaN 47.15854 7.47088 30.00000 42.25000 48.00000 52.00000 65.00000
Gender 82 2 M 54 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 82.00000 6.00000 2.00000 23.00000 NaN NaN NaN NaN NaN NaN NaN
Education_Level 73 6 Graduate 28 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 77 3 Single 43 NaN NaN NaN NaN NaN NaN NaN
Income_Category 82 6 $80K - $120K 22 NaN NaN NaN NaN NaN NaN NaN
Card_Category 82 1 Silver 82 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 82.00000 NaN NaN NaN 36.36585 7.72124 18.00000 33.00000 36.00000 42.75000 56.00000
Total_Relationship_Count 82.00000 6.00000 2.00000 23.00000 NaN NaN NaN NaN NaN NaN NaN
Months_Inactive_12_mon 82.00000 6.00000 3.00000 40.00000 NaN NaN NaN NaN NaN NaN NaN
Contacts_Count_12_mon 82.00000 6.00000 3.00000 37.00000 NaN NaN NaN NaN NaN NaN NaN
Credit_Limit 82.00000 NaN NaN NaN 25960.26829 10054.77810 3735.00000 14890.00000 33092.00000 34516.00000 34516.00000
Total_Revolving_Bal 82.00000 NaN NaN NaN 677.24390 894.66458 0.00000 0.00000 184.00000 1267.75000 2517.00000
Avg_Open_To_Buy 82.00000 NaN NaN NaN 25283.02439 10072.89178 3735.00000 14023.75000 31999.00000 34116.75000 34516.00000
Total_Amt_Chng_Q4_Q1 82.00000 NaN NaN NaN 0.70591 0.24399 0.00000 0.55225 0.73350 0.87675 1.49200
Total_Trans_Amt 82.00000 NaN NaN NaN 4899.68293 3140.82380 691.00000 1929.50000 4753.00000 8144.00000 10294.00000
Total_Trans_Ct 82.00000 NaN NaN NaN 53.28049 17.49728 14.00000 41.00000 50.50000 68.75000 94.00000
Total_Ct_Chng_Q4_Q1 82.00000 NaN NaN NaN 0.61301 0.24940 0.00000 0.43475 0.59100 0.78325 1.21100
Avg_Utilization_Ratio 82.00000 NaN NaN NaN 0.03244 0.05179 0.00000 0.00000 0.00500 0.05350 0.23400
Agebin 82 4 46-55 42 NaN NaN NaN NaN NaN NaN NaN
In [147]:
#Profile of Attrited Customer with platinum Card 
dfc[(dfc['Card_Category']=='Platinum') & (dfc['Attrition_Flag']=='Attrited Customer')].describe(include='all').T
Out[147]:
count unique top freq mean std min 25% 50% 75% max
Attrition_Flag 5 1 Attrited Customer 5 NaN NaN NaN NaN NaN NaN NaN
Customer_Age 5.00000 NaN NaN NaN 49.40000 4.15933 43.00000 48.00000 51.00000 51.00000 54.00000
Gender 5 2 F 4 NaN NaN NaN NaN NaN NaN NaN
Dependent_count 5.00000 4.00000 2.00000 2.00000 NaN NaN NaN NaN NaN NaN NaN
Education_Level 5 3 Graduate 3 NaN NaN NaN NaN NaN NaN NaN
Marital_Status 5 2 Single 3 NaN NaN NaN NaN NaN NaN NaN
Income_Category 5 3 Less than $40K 2 NaN NaN NaN NaN NaN NaN NaN
Card_Category 5 1 Platinum 5 NaN NaN NaN NaN NaN NaN NaN
Months_on_book 5.00000 NaN NaN NaN 36.00000 4.52769 31.00000 32.00000 37.00000 38.00000 42.00000
Total_Relationship_Count 5.00000 3.00000 2.00000 3.00000 NaN NaN NaN NaN NaN NaN NaN
Months_Inactive_12_mon 5.00000 2.00000 3.00000 3.00000 NaN NaN NaN NaN NaN NaN NaN
Contacts_Count_12_mon 5.00000 3.00000 3.00000 2.00000 NaN NaN NaN NaN NaN NaN NaN
Credit_Limit 5.00000 NaN NaN NaN 24997.40000 9281.90682 15987.00000 15987.00000 23981.00000 34516.00000 34516.00000
Total_Revolving_Bal 5.00000 NaN NaN NaN 263.40000 284.58795 0.00000 0.00000 193.00000 531.00000 593.00000
Avg_Open_To_Buy 5.00000 NaN NaN NaN 24734.00000 9212.95080 15794.00000 15987.00000 23388.00000 33985.00000 34516.00000
Total_Amt_Chng_Q4_Q1 5.00000 NaN NaN NaN 0.78480 0.22978 0.43500 0.69500 0.82700 0.98000 0.98700
Total_Trans_Amt 5.00000 NaN NaN NaN 4755.80000 2073.95292 2021.00000 3901.00000 4758.00000 5418.00000 7681.00000
Total_Trans_Ct 5.00000 NaN NaN NaN 60.20000 10.03494 46.00000 54.00000 65.00000 65.00000 71.00000
Total_Ct_Chng_Q4_Q1 5.00000 NaN NaN NaN 0.54560 0.14819 0.39400 0.42100 0.51200 0.69000 0.71100
Avg_Utilization_Ratio 5.00000 NaN NaN NaN 0.01040 0.01064 0.00000 0.00000 0.01200 0.01500 0.02500
Agebin 5 2 46-55 4 NaN NaN NaN NaN NaN NaN NaN

Profile of customer who attrited most based on there card type

  • #### Blue Card

    • Most likely Female who were married , age group 46-55 and earning less than 40 k, Education level graduate and dependent member 3 , total bank product 3 and were inactive for 3 months. There average utilzation ratio was very low
  • #### Gold Card

    • Most likely Male who are single , between age group 36-45 earning 60- 80k, education level graduate and inactive for 3 months
  • #### Silver Card
    • Most likely Male who are single , between age group 46-55 , earned between 80 k -120 k ,education level graduate and inactive for 3 months
  • #### Platinum card
    • Most likely Female who were single , age group 46-55 ,earning less than 40 k , education level graduate and inactive for 3 months

Insights based on EDA¶

  • ~16% customer attrited .
  • Female customer attrited more compared to male.
  • Customers who were single attrited more.
  • Customers with plantinum card attrited more but there are only 20 samples so this is inclusive. Customers with gold attrited more compared to blue and silver card.
  • Customer in age range 36-55 attrited more.
  • Customers who were doctorate or postgraduate attrited most.
  • Surprising Attrition has been higher when there is higher number of contacts with the Bank in the last 12 months.

Outlier Detection¶

In [148]:
Q1 = dfc.quantile(0.25)             #To find the 25th percentile and 75th percentile.
Q3 = dfc.quantile(0.75)

IQR = Q3 - Q1                           #Inter Quantile Range (75th perentile - 25th percentile)

lower=Q1-1.5*IQR                        #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper=Q3+1.5*IQR
In [149]:
((dfc.select_dtypes(include=['float64','int64'])<lower) | (dfc.select_dtypes(include=['float64','int64'])>upper)).sum()/len(dfc)*100
Out[149]:
Customer_Age            0.01975
Months_on_book          3.81159
Credit_Limit            9.71660
Total_Revolving_Bal     0.00000
Avg_Open_To_Buy         9.50923
Total_Amt_Chng_Q4_Q1    3.91034
Total_Trans_Amt         8.84764
Total_Trans_Ct          0.01975
Total_Ct_Chng_Q4_Q1     3.89059
Avg_Utilization_Ratio   0.00000
dtype: float64
In [150]:
numeric_columns = dfc.select_dtypes('number').columns.to_list()
# outlier detection using boxplot
plt.figure(figsize=(20,30))

for i, variable in enumerate(numeric_columns):
                     plt.subplot(4,4,i+1)
                     plt.boxplot(dfc[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()
In [151]:
print(upper)
Customer_Age               68.50000
Months_on_book             53.50000
Credit_Limit            23836.25000
Total_Revolving_Bal      3921.50000
Avg_Open_To_Buy         22660.75000
Total_Amt_Chng_Q4_Q1        1.20100
Total_Trans_Amt          8619.25000
Total_Trans_Ct            135.00000
Total_Ct_Chng_Q4_Q1         1.17200
Avg_Utilization_Ratio       1.22300
dtype: float64
In [152]:
dfc[dfc['Credit_Limit'] > upper.Credit_Limit].sort_values(by='Credit_Limit',ascending=False ).count()
Out[152]:
Attrition_Flag              984
Customer_Age                984
Gender                      984
Dependent_count             984
Education_Level             839
Marital_Status              897
Income_Category             984
Card_Category               984
Months_on_book              984
Total_Relationship_Count    984
Months_Inactive_12_mon      984
Contacts_Count_12_mon       984
Credit_Limit                984
Total_Revolving_Bal         984
Avg_Open_To_Buy             984
Total_Amt_Chng_Q4_Q1        984
Total_Trans_Amt             984
Total_Trans_Ct              984
Total_Ct_Chng_Q4_Q1         984
Avg_Utilization_Ratio       984
Agebin                      984
dtype: int64
In [153]:
dfc[dfc['Credit_Limit']== 34516.00000].count() # had seen this number during EDA so verifying
Out[153]:
Attrition_Flag              508
Customer_Age                508
Gender                      508
Dependent_count             508
Education_Level             436
Marital_Status              468
Income_Category             508
Card_Category               508
Months_on_book              508
Total_Relationship_Count    508
Months_Inactive_12_mon      508
Contacts_Count_12_mon       508
Credit_Limit                508
Total_Revolving_Bal         508
Avg_Open_To_Buy             508
Total_Amt_Chng_Q4_Q1        508
Total_Trans_Amt             508
Total_Trans_Ct              508
Total_Ct_Chng_Q4_Q1         508
Avg_Utilization_Ratio       508
Agebin                      508
dtype: int64

508 customer have credit limit at 34516, it seems to be some default value.

In [154]:
dfc[dfc['Total_Trans_Amt'] > upper.Total_Trans_Amt].sort_values(by='Total_Trans_Amt',ascending=False ).head(10)
Out[154]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Agebin
9964 Existing Customer 47 M 4 NaN Married $60K - $80K Blue 36 4 5 2 10585.00000 1749 8836.00000 0.65500 18484 108 0.58800 0.16500 46-55
10073 Existing Customer 51 M 2 Graduate Married $60K - $80K Blue 40 3 3 3 3750.00000 1801 1949.00000 0.88900 17995 116 0.65700 0.48000 46-55
10097 Existing Customer 31 M 0 High School Single $40K - $60K Blue 25 3 2 3 4493.00000 1388 3105.00000 0.79500 17744 104 0.76300 0.30900 25-35
9601 Existing Customer 45 M 4 High School Single $60K - $80K Blue 35 1 1 3 8449.00000 2092 6357.00000 0.70900 17634 120 0.66700 0.24800 36-45
9341 Existing Customer 48 M 2 High School Married Less than $40K Silver 36 2 2 2 14581.00000 2517 12064.00000 0.77600 17628 109 0.81700 0.17300 46-55
10117 Existing Customer 57 M 2 Graduate Married $80K - $120K Blue 40 6 3 4 17925.00000 1909 16016.00000 0.71200 17498 111 0.82000 0.10600 56-65
10028 Existing Customer 36 F 1 Graduate Single Less than $40K Blue 16 6 5 3 6091.00000 1184 4907.00000 0.76600 17437 113 0.76600 0.19400 36-45
9643 Existing Customer 54 F 2 Graduate Married $40K - $60K Blue 41 3 1 3 7362.00000 1176 6186.00000 0.73500 17390 130 0.68800 0.16000 46-55
9712 Existing Customer 49 M 4 Post-Graduate Single $80K - $120K Blue 42 3 2 1 30885.00000 2018 28867.00000 0.90400 17350 115 0.62000 0.06500 46-55
9645 Existing Customer 35 M 3 Post-Graduate Married $80K - $120K Blue 28 3 1 2 4380.00000 0 4380.00000 0.71900 17258 121 0.70400 0.00000 25-35

896 customers has transcational amount greater than 8619.25000.With number of transcation count this data seems to be correct.

In [155]:
dfc[dfc['Avg_Open_To_Buy'] > upper.Avg_Open_To_Buy].sort_values(by='Avg_Open_To_Buy',ascending=False ).head(10)
Out[155]:
Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Agebin
10112 Attrited Customer 33 M 2 College Married $120K + Gold 20 2 1 4 34516.00000 0 34516.00000 1.00400 9338 73 0.62200 0.00000 25-35
9047 Attrited Customer 50 M 1 Post-Graduate NaN $80K - $120K Gold 36 2 3 2 34516.00000 0 34516.00000 1.03200 5547 75 0.74400 0.00000 46-55
2196 Existing Customer 50 M 3 High School Married $120K + Blue 40 5 1 4 34516.00000 0 34516.00000 0.98600 1930 36 0.44000 0.00000 46-55
2201 Attrited Customer 55 F 2 College Single abc Silver 36 2 3 3 34516.00000 0 34516.00000 0.39900 1353 40 0.21200 0.00000 46-55
9127 Existing Customer 56 F 3 Uneducated Single abc Platinum 46 2 3 2 34516.00000 0 34516.00000 0.88700 8416 93 0.63200 0.00000 56-65
2341 Existing Customer 52 M 1 NaN Single $120K + Blue 44 6 1 2 34516.00000 0 34516.00000 1.03000 2848 56 0.75000 0.00000 46-55
9075 Existing Customer 51 M 3 Uneducated Married $80K - $120K Silver 37 1 2 1 34516.00000 0 34516.00000 0.81400 8736 97 0.70200 0.00000 46-55
9073 Existing Customer 39 M 2 NaN Single $80K - $120K Gold 33 1 3 2 34516.00000 0 34516.00000 0.72400 9179 113 0.76600 0.00000 36-45
9068 Attrited Customer 54 F 0 Graduate Single abc Platinum 38 2 2 2 34516.00000 0 34516.00000 0.69500 3901 54 0.42100 0.00000 46-55
9027 Attrited Customer 44 M 4 NaN Married $120K + Blue 36 4 3 3 34516.00000 0 34516.00000 1.04300 5425 60 0.87500 0.00000 36-45

Not treating outliers here, and want alogorthims to learn about this outliers.

Missing Value Detection & Treatment¶

There are Unknown values for the columns Education_Level,Marital_Status & Income_Category which can be treated as missing values. Replacing Unknown with nan

In [156]:
dfc = dfc.replace({'Unknown': None})
In [157]:
dfc.isnull().sum()
Out[157]:
Attrition_Flag                 0
Customer_Age                   0
Gender                         0
Dependent_count                0
Education_Level             1519
Marital_Status               749
Income_Category                0
Card_Category                  0
Months_on_book                 0
Total_Relationship_Count       0
Months_Inactive_12_mon         0
Contacts_Count_12_mon          0
Credit_Limit                   0
Total_Revolving_Bal            0
Avg_Open_To_Buy                0
Total_Amt_Chng_Q4_Q1           0
Total_Trans_Amt                0
Total_Trans_Ct                 0
Total_Ct_Chng_Q4_Q1            0
Avg_Utilization_Ratio          0
Agebin                         0
dtype: int64

Top

In [158]:
dfc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Attrition_Flag            10127 non-null  object 
 1   Customer_Age              10127 non-null  int64  
 2   Gender                    10127 non-null  object 
 3   Dependent_count           10127 non-null  object 
 4   Education_Level           8608 non-null   object 
 5   Marital_Status            9378 non-null   object 
 6   Income_Category           10127 non-null  object 
 7   Card_Category             10127 non-null  object 
 8   Months_on_book            10127 non-null  int64  
 9   Total_Relationship_Count  10127 non-null  object 
 10  Months_Inactive_12_mon    10127 non-null  object 
 11  Contacts_Count_12_mon     10127 non-null  object 
 12  Credit_Limit              10127 non-null  float64
 13  Total_Revolving_Bal       10127 non-null  int64  
 14  Avg_Open_To_Buy           10127 non-null  float64
 15  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 16  Total_Trans_Amt           10127 non-null  int64  
 17  Total_Trans_Ct            10127 non-null  int64  
 18  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 19  Avg_Utilization_Ratio     10127 non-null  float64
 20  Agebin                    10127 non-null  object 
dtypes: float64(5), int64(5), object(11)
memory usage: 1.6+ MB

Missing-Value Treatment¶

  • We will use KNN imputer to impute missing values. k-Nearest Neighbours (kNN) identifies the neighboring points through a measure of distance and the missing values can be estimated using completed values of neighboring observations.
  • KNNImputer: Each sample's missing values are imputed by looking at the n_neighbors nearest neighbors found in the training set. Default value for n_neighbors=5.
  • KNN imputer replaces missing values using the average of k nearest non-missing feature values.
  • Nearest points are found based on euclidean distance.
In [159]:
# Label Encode categorical variables  
attrition = {'Existing Customer':0, 'Attrited Customer':1}
dfc['Attrition_Flag']=dfc['Attrition_Flag'].map(attrition)

marital_status = {'Married':1,'Single':2, 'Divorced':3}
dfc['Marital_Status']=dfc['Marital_Status'].map(marital_status)


education = {'Uneducated':1,'High School':2, 'Graduate':3, 'College':4, 'Post-Graduate':5, 'Doctorate':6}
dfc['Education_Level']=dfc['Education_Level'].map(education)

income = {'Less than $40K':1,'$40K - $60K':2, '$60K - $80K':3, '$80K - $120K':4, '$120K +':5}
dfc['Income_Category']=dfc['Income_Category'].map(income)
In [160]:
imputer = KNNImputer(n_neighbors=5)
In [161]:
reqd_col_for_impute = ['Income_Category','Education_Level','Marital_Status']

Split the dataset¶

  • Since we have a significant imbalance in the distribution of the target classes, we will use stratified sampling to ensure that relative class frequencies are approximately preserved in train and test sets.
  • For that we will use the stratify parameter in the train_test_split function.
  • Dropping Avg_open_to_buy has it is highly correlated with credit limit
In [162]:
# Separating target column
X = dfc.drop(['Agebin','Attrition_Flag','Avg_Open_To_Buy'],axis=1)
#X = pd.get_dummies(X,drop_first=True)
y = dfc['Attrition_Flag']
In [163]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1,stratify=y)
X_train.shape, X_test.shape
Out[163]:
((7088, 18), (3039, 18))
In [164]:
#Fit and transform the train data
X_train[reqd_col_for_impute]=imputer.fit_transform(X_train[reqd_col_for_impute])

#Transform the test data 
X_test[reqd_col_for_impute]=imputer.transform(X_test[reqd_col_for_impute])
In [165]:
#Checking that no column has missing values in train or test sets
print(X_train.isnull().sum())
print('-'*30)
print(X_test.isnull().sum())
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
In [166]:
## Function to inverse the encoding
def inverse_mapping(x,y):
    inv_dict = {v: k for k, v in x.items()}
    X_train[y] = np.round(X_train[y]).map(inv_dict).astype('category')
    X_test[y] = np.round(X_test[y]).map(inv_dict).astype('category')
In [167]:
inverse_mapping(education,'Education_Level')
inverse_mapping(marital_status,'Marital_Status')
inverse_mapping(income,'Income_Category')

Encoding categorical variables¶

In [168]:
X_train=pd.get_dummies(X_train,drop_first=True)
X_test=pd.get_dummies(X_test,drop_first=True)
print(X_train.shape, X_test.shape)
(7088, 46) (3039, 46)
In [169]:
X_train
Out[169]:
Customer_Age Months_on_book Credit_Limit Total_Revolving_Bal Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender_M Dependent_count_1 Dependent_count_2 ... Months_Inactive_12_mon_1 Months_Inactive_12_mon_2 Months_Inactive_12_mon_3 Months_Inactive_12_mon_4 Months_Inactive_12_mon_5 Months_Inactive_12_mon_6 Contacts_Count_12_mon_1 Contacts_Count_12_mon_2 Contacts_Count_12_mon_3 Contacts_Count_12_mon_4 Contacts_Count_12_mon_5 Contacts_Count_12_mon_6
4124 50 43 7985.00000 0 1.03200 3873 72 0.67400 0.00000 0 1 0 ... 1 0 0 0 0 0 0 1 0 0 0 0
4686 50 36 5444.00000 2499 0.46800 4509 80 0.66700 0.45900 1 0 0 ... 0 0 1 0 0 0 0 1 0 0 0 0
1276 26 13 1643.00000 1101 0.71300 2152 50 0.47100 0.67000 0 0 0 ... 0 0 1 0 0 0 0 0 0 1 0 0
6119 65 55 2022.00000 0 0.57900 4623 65 0.54800 0.00000 0 0 0 ... 0 0 1 0 0 0 0 0 0 0 0 0
2253 46 35 4930.00000 0 1.01900 3343 77 0.63800 0.00000 1 0 0 ... 0 0 1 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4581 50 36 1438.30000 0 0.65800 2329 43 0.59300 0.00000 0 0 1 ... 0 1 0 0 0 0 0 1 0 0 0 0
9338 56 47 7204.00000 0 0.80300 14042 113 0.76600 0.00000 1 0 0 ... 1 0 0 0 0 0 0 1 0 0 0 0
9950 45 36 34516.00000 0 0.73200 8603 84 0.61500 0.00000 1 0 0 ... 0 1 0 0 0 0 1 0 0 0 0 0
1784 35 29 34516.00000 1965 1.04400 2949 70 1.00000 0.05700 1 0 0 ... 0 1 0 0 0 0 0 0 1 0 0 0
4752 41 32 3189.00000 0 0.91000 4813 86 0.75500 0.00000 0 0 0 ... 0 0 1 0 0 0 1 0 0 0 0 0

7088 rows × 46 columns

In [170]:
# # defining empty lists to add train and test results 
model_name=[]
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []

def make_confusion_matrix(y_actual,y_predict,title):
    '''Plot confusion matrix'''
    fig, ax = plt.subplots(1, 1)
    
    cm = confusion_matrix(y_actual, y_predict, labels=[0,1])
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                               display_labels=["No","Yes"])
    disp.plot(cmap='Blues',ax=ax)
    
    ax.set_title(title)
    plt.tick_params(axis=u'both', which=u'both',length=0)
    plt.grid(b=None,axis='both',which='both',visible=False)
    plt.show()
In [171]:
def get_metrics_score(model,modelname,X_train_pass,X_test_df_pass,y_train_pass,y_test_pass):
    '''
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    '''
    # defining an empty list to store train and test results
    score_list=[]
    
    pred_train = model.predict(X_train_pass)
    pred_test = model.predict(X_test_df_pass)
    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)
    train_acc = accuracy_score(y_train_pass,pred_train)
    test_acc = accuracy_score(y_test_pass,pred_test)
    train_recall = recall_score(y_train_pass,pred_train)
    test_recall = recall_score(y_test_pass,pred_test)
    train_precision = precision_score(y_train_pass,pred_train)
    test_precision = precision_score(y_test_pass,pred_test)
    train_f1 = f1_score(y_train_pass,pred_train)
    test_f1 = f1_score(y_test_pass,pred_test)
    score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1))
    model_name.append(modelname)  
    acc_train.append(score_list[0])
    acc_test.append(score_list[1])
    recall_train.append(score_list[2])
    recall_test.append(score_list[3])
    precision_train.append(score_list[4])
    precision_test.append(score_list[5])
    f1_train.append(score_list[6])
    f1_test.append(score_list[7])
    metric_names = ['Train_Accuracy', 'Test_Accuracy', 'Train_Recall', 'Test_Recall','Train_Precision',
                          'Test_Precision', 'Train_F1-Score', 'Test_F1-Score']
    cols = ['Metric', 'Score']
    records = [(name, score) for name, score in zip(metric_names, score_list)]
    display(pd.DataFrame.from_records(records, columns=cols, index='Metric').T)
    # display confusion matrix
    make_confusion_matrix(y_train_pass,pred_train,"Confusion Matrix for Train")     
    make_confusion_matrix(y_test_pass,pred_test,"Confusion Matrix for Test") 
    return score_list # returning the list with train and test scores

Model Building

Model evaluation criterion:¶

Model can make wrong predictions as:¶

  1. Predicting a customer will churn but he does not - Loss of resources
  2. Predicting a customer will not churn the services but he does - Loss of income

Which case is more important?¶

  • Predicting that customer will not churn but he does i.e. losing on a potential source of income for the bank . Bank can taken actions to stop these customer from churning.

How to reduce this loss i.e need to reduce False Negatives?¶

  • Banks wants Recall to be maximized, greater the Recall lesser the chances of false negatives means lesser chances of predicting customers will not churn when in reality they do.

Model Building Logistic Regression¶

In [172]:
#Initialize model using pipeline
pipe_lr = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

#Fit on train data
pipe_lr.fit(X_train,y_train)
Out[172]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=1))])
StandardScaler()
LogisticRegression(random_state=1)
In [173]:
lr_score=get_metrics_score(pipe_lr,'LogisticRegression',X_train,X_test,y_train,y_test)
Metric Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1-Score Test_F1-Score
Score 0.91366 0.91214 0.62687 0.63730 0.79245 0.77556 0.70000 0.69966

Let's evaluate the model performance by using KFold and cross_val_score

K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive stratified folds (without shuffling by default). Each fold is then used once as validation while the k - 1 remaining folds form the training set.

In [174]:
#Evaluate the model performance by using KFold and cross_val_score
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)     #Setting number of splits equal to 5
lr_cv_result=cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)

#Plotting boxplots for CV scores of model defined above
plt.boxplot(lr_cv_result)
plt.show()
  • Performance on training set is in range between 0.58 to 0.66 recall with the average recall being 0.61

Handling Imbalanced dataset

This is an Imbalanced dataset .A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary.

One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

Over Sampling¶

Since dataset is imbalanced let try oversampling using SMOTE and see if performance can be improved.

In [175]:
print(f"Before UpSampling, counts of label attrited customer: {sum(y_train==1)}")
print(f"Before UpSampling, counts of label existing customer: {sum(y_train==0)} \n")

sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1)   #Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train.ravel())

print(f"After UpSampling, counts of label attrited customer: {sum(y_train_over==1)}")
print(f"After UpSampling, counts of label existing customer: {sum(y_train_over==0)} \n")

print(f'After UpSampling, the shape of train_X: {X_train_over.shape}')
print(f'After UpSampling, the shape of train_y: {y_train_over.shape} \n')
Before UpSampling, counts of label attrited customer: 1139
Before UpSampling, counts of label existing customer: 5949 

After UpSampling, counts of label attrited customer: 5949
After UpSampling, counts of label existing customer: 5949 

After UpSampling, the shape of train_X: (11898, 46)
After UpSampling, the shape of train_y: (11898,) 

In [176]:
lr_over = LogisticRegression(solver='liblinear')
lr_over.fit(X_train_over, y_train_over)
Out[176]:
LogisticRegression(solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(solver='liblinear')
In [177]:
lr_score_over=get_metrics_score(lr_over,'LogisticRegression with over sampling',X_train_over,X_test,y_train_over,y_test)
Metric Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1-Score Test_F1-Score
Score 0.91461 0.85620 0.88939 0.48975 0.93663 0.55972 0.91240 0.52240

The recall on test data is only 0.48 ,and model is overfitting there is lot of discrepancy between test score and train score. let try regularization

What is Regularization ?

Linear regression algorithm works by selecting coefficients for each independent variable that minimizes a loss function. However, if the coefficients are large, they can lead to over-fitting on the training dataset, and such a model will not generalize well on the unseen test data.This is where regularization helps. Regularization is the process which regularizes or shrinks the coefficients towards zero. In simple words, regularization discourages learning a more complex or flexible model, to prevent overfitting.

Main Regularization Techniques

Ridge Regression (L2 Regularization)

Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.

Lasso Regression (L1 Regularizaion)

Lasso adds "absolute values of magnitude of coefficient as penalty term to the loss function

Elastic Net Regression

Elastic net regression combines the properties of ridge and lasso regression. It works by penalizing the model using both the 1l2-norm1 and the 1l1-norm1.

Elastic Net Formula: Ridge + Lasso

Regularization on Oversampled dataset¶

In [178]:
# Choose the type of classifier. 
pipe_lr_reg = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

# Grid of parameters to choose from
parameters = {'logisticregression__C': np.arange(0.007,0.5,0.01),
              'logisticregression__solver' : ['liblinear','newton-cg','lbfgs','sag','saga'],
              'logisticregression__penalty': ['l1','l2']
             }

# Run the grid search
grid_obj = RandomizedSearchCV(pipe_lr_reg, parameters, scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train_over, y_train_over)

# Set the clf to the best combination of parameters
pipe_lr_reg = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
pipe_lr_reg.fit(X_train_over, y_train_over)
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
Out[178]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=0.007, penalty='l1', random_state=1,
                                    solver='saga'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=0.007, penalty='l1', random_state=1,
                                    solver='saga'))])
StandardScaler()
LogisticRegression(C=0.007, penalty='l1', random_state=1, solver='saga')
In [179]:
lr_score_under=get_metrics_score(pipe_lr_reg,'LogisticRegression with Regularization on Over sampling',X_train_over,X_test,y_train_over,y_test)
Metric Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1-Score Test_F1-Score
Score 0.93696 0.90260 0.92100 0.63525 0.95138 0.72430 0.93594 0.67686

The recall on test data has improved let see if undersampling can improve the recall

Undersampling¶

Let see try undersampling and see if performance is different.

In [180]:
rus = RandomUnderSampler(random_state = 1) # Undersample dependent variable
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
#Undersample to balance classes
print("Before Under Sampling, counts of label 'Attrited': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train==0)))

print("After Under Sampling, counts of label 'Attrited': {}".format(sum(y_train_under==1)))
print("After Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train_under==0)))

print('After Under Sampling, the shape of train_X: {}'.format(X_train_under.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_under.shape))
                                          
Before Under Sampling, counts of label 'Attrited': 1139
Before Under Sampling, counts of label 'Existing': 5949 

After Under Sampling, counts of label 'Attrited': 1139
After Under Sampling, counts of label 'Existing': 1139 

After Under Sampling, the shape of train_X: (2278, 46)
After Under Sampling, the shape of train_y: (2278,) 

Logistic Regression on undersampled data¶

In [181]:
# Initialize model using pipeline
pipe_lr_under = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

# Training the basic logistic regression model with training set 
pipe_lr_under.fit(X_train_under,y_train_under)
Out[181]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(random_state=1))])
StandardScaler()
LogisticRegression(random_state=1)
In [182]:
#Evaluate the model performance by using KFold and cross_val_score
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)     #Setting number of splits equal to 5
cv_result_under=cross_val_score(estimator=pipe_lr_under, X=X_train_under, y=y_train_under, scoring=scoring, cv=kfold)

#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_under)
plt.show()
In [183]:
lr_score_under=get_metrics_score(pipe_lr_under,'LogisticRegression with under sampling',X_train_under,X_test,y_train_under,y_test)
Metric Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1-Score Test_F1-Score
Score 0.85777 0.84962 0.85777 0.84016 0.85777 0.51965 0.85777 0.64213

Observation¶

  • Model after undersampling is generalized well on training and test set . Our recall after undersampling on test was better than our recall after oversampling on test.Let try regularization and see. Trying to use all the solver and different penality
In [184]:
# Choose the type of classifier. 
pipe_lr_reg_under = make_pipeline( StandardScaler(), (LogisticRegression(random_state=1)))

# Grid of parameters to choose from
parameters = {'logisticregression__C': np.arange(0.007,0.5,0.01),
              'logisticregression__solver' : ['liblinear','newton-cg','lbfgs','sag','saga'],
              'logisticregression__penalty': ['l1','l2']
             }

# Run the grid search
grid_obj = RandomizedSearchCV(pipe_lr_reg_under, parameters, scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train_under, y_train_under)

# Set the clf to the best combination of parameters
pipe_lr_reg_under = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
pipe_lr_reg_under.fit(X_train_under, y_train_under)
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/rcs713/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:350: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
Out[184]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=0.4870000000000001, penalty='l1',
                                    random_state=1, solver='saga'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=0.4870000000000001, penalty='l1',
                                    random_state=1, solver='saga'))])
StandardScaler()
LogisticRegression(C=0.4870000000000001, penalty='l1', random_state=1,
                   solver='saga')
In [185]:
lr_score_reg=get_metrics_score(pipe_lr_reg_under,'LogisticRegression with Regularization on Undersampled',X_train_under,X_test,y_train_under,y_test)
Metric Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1-Score Test_F1-Score
Score 0.85558 0.84798 0.85689 0.84016 0.85464 0.51637 0.85577 0.63963

Model Performance Evaluation and Improvement-Logistic Regression

In [186]:
comparison_frame = pd.DataFrame({'Model':model_name,
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_F1':f1_train,
                                          'Test_F1':f1_test  }) 

#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_Recall',ascending=False)
Out[186]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision Train_F1 Test_F1
3 LogisticRegression with under sampling 0.85777 0.84962 0.85777 0.84016 0.85777 0.51965 0.85777 0.64213
4 LogisticRegression with Regularization on Undersampled 0.85558 0.84798 0.85689 0.84016 0.85464 0.51637 0.85577 0.63963
0 LogisticRegression 0.91366 0.91214 0.62687 0.63730 0.79245 0.77556 0.70000 0.69966
2 LogisticRegression with Regularization on Over sampling 0.93696 0.90260 0.92100 0.63525 0.95138 0.72430 0.93594 0.67686
1 LogisticRegression with over sampling 0.91461 0.85620 0.88939 0.48975 0.93663 0.55972 0.91240 0.52240

Logistic Regression with Under sampling is giving a generalized model and best recall with 0.857.

Model building Decision Tree ,Bagging and Boosting

Here I am building different models using KFold and cross_val_score with pipelines and will tune the best model 3 models using GridSearchCV and RandomizedSearchCV

Stratified K-Folds cross-validation provides dataset indices to split data into train/validation sets. Split dataset into k consecutive folds (without shuffling by default) keeping the distribution of both classes in each fold the same as the target variable. Each fold is then used once as validation while the k - 1 remaining folds form the training set.

In [187]:
models = []  # Empty list to store all the models

# Appending pipelines for each model into the list
models.append(
    (
        "DTREE",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("decision_tree", DecisionTreeClassifier(random_state=1)),
            ]
        ),
    )
)

models.append(
    (
        "RF",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("random_forest", RandomForestClassifier(random_state=1)),
            ]
        ),
    )
)

models.append(
    (
        "BG",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("bagging", BaggingClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "GBM",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("gradient_boosting", GradientBoostingClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "ADB",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("adaboost", AdaBoostClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "XGB",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
            ]
        ),
    )
)


results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models

# loop through all models to get the mean cross validated score
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))
DTREE: 77.78537754076822
RF: 72.78112682587526
BG: 79.54246850606694
GBM: 82.61341680191669
ADB: 80.77053868150553
XGB: 85.77672153953164
In [188]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()
  • We can see that XGBoost is giving the highest cross-validated recall with just one outlier followed by Gradient Boost, Adaboost. Bagging classifier had maxiumum recall ~ 84 but the minimum was 74 resulting into mean being only ~79. therefore I didn't choose bagging classfier.
  • Best performing three models are XGBoost model, Gradient Boost, Adaboost.
  • We will tune our 3 best models to see if the performance improves after tuning

Hyperparameter Tuning¶

We will use pipelines with StandardScaler and classifiers model and tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.

Random Search . Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.

Grid Search Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

We can also use the make_pipeline function instead of Pipeline to create a pipeline.

make_pipeline: This is a shorthand for the Pipeline constructor; it does not require and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

Adaboost Using Grid Search¶

In [ ]:
%%time
# Creating pipeline
pipe_ada_grid = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    "adaboostclassifier__n_estimators": np.arange(10, 110, 10),
    "adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "adaboostclassifier__base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
pipe_ada_grid = GridSearchCV(estimator=pipe_ada_grid, param_grid=param_grid, scoring=scorer, cv=5,n_jobs = -1)

# Fitting parameters with undersampled train data in GridSeachCV
pipe_ada_grid.fit(X_train, y_train)
                              
print("Best parameters are {} with CV score={}:" .format(pipe_ada_grid.best_params_,pipe_ada_grid.best_score_))
In [ ]:
# Creating new pipeline with best parameters
abc_tuned_grid = make_pipeline(
    StandardScaler(),AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2, 
                                                                              random_state=1),
                                        learning_rate=1, n_estimators=70))

# Fit the model on undersampled training data
abc_tuned_grid.fit(X_train, y_train)
In [ ]:
abc_tuned_score=get_metrics_score(abc_tuned_grid,' Adaboost with Grid Search',X_train,X_test,y_train,y_test)
In [ ]:
feature_names = X_train.columns
importances = abc_tuned_grid[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • The test recall has increased by ~6% as compared to cross-validated recall
  • Model is generalized , let see if random search give different result

Adaboost Using Random Search¶

In [ ]:
%%time

# Creating pipeline
pipe_ada_ran = make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    "adaboostclassifier__n_estimators": np.arange(10, 110, 10),
    "adaboostclassifier__learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
    "adaboostclassifier__base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
abc_rand_cv = RandomizedSearchCV(estimator=pipe_ada_ran, param_distributions=param_grid, n_iter=10,n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
abc_rand_cv.fit(X_train,y_train)


print("Best parameters are {} with CV score={}:" .format(abc_rand_cv.best_params_,abc_rand_cv.best_score_))
In [ ]:
# Creating new pipeline with best parameters
abc_tuned_rand = make_pipeline(
    StandardScaler(),AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                                              random_state=1),
                                        learning_rate=1, n_estimators=90))

# Fit the model on training data
abc_tuned_rand.fit(X_train, y_train)
In [ ]:
abc_rand_tuned_score=get_metrics_score(abc_tuned_rand,' Adaboost with Random Search',X_train,X_test,y_train,y_test)
  • Here Random search took a less time , but the recall has improved with the random search.False negative cases have reduced.
  • Grid search took a significantly longer time than random search.
In [ ]:
feature_names = X_train.columns
importances = abc_tuned_rand[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

GradientBoosting with Grid Search¶

In [ ]:
%%time
# Creating pipeline
pipe_gb_grid = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))

# Grid of parameters to choose from
param_grid = {'gradientboostingclassifier__n_estimators':[100,200],
              'gradientboostingclassifier__max_depth':[10,20],
              'gradientboostingclassifier__min_samples_leaf': [10,20],
              'gradientboostingclassifier__min_samples_split': [25,35]
              }
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_cv = GridSearchCV(pipe_gb_grid, param_grid, scoring=scorer,cv=5,n_jobs = -1)

# Fitting parameters in GridSeachCV
pipe_gb_grid = grid_cv.fit(X_train, y_train)


print("Best parameters are {} with CV score={}:" .format(pipe_gb_grid.best_params_,grid_cv.best_score_))
In [ ]:
# Creating new pipeline with best parameters
gb_tuned_grid = make_pipeline(
    StandardScaler(),GradientBoostingClassifier(max_depth=20,
                                            min_samples_leaf=20,
                                            min_samples_split=25,
                                            n_estimators=200, random_state=1
                                            ))

# Fit the model on training data
gb_tuned_grid.fit(X_train, y_train)
In [ ]:
gb_tuned_score=get_metrics_score(gb_tuned_grid,' Gradient with Grid Search',X_train,X_test,y_train,y_test)
In [ ]:
feature_names = X_train.columns
importances = gb_tuned_grid[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Grid search gave a better recall than cross validation.
  • Model is overfitting. Let see how randomized search perform.

GradientBoosting with Random Search¶

In [ ]:
%%time 
pipe_gb_rand = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=1))

param_grid = {'gradientboostingclassifier__n_estimators':[100,200],
              'gradientboostingclassifier__max_depth':[10,20],
              'gradientboostingclassifier__min_samples_leaf': [10,20],
              'gradientboostingclassifier__min_samples_split': [25,35]
              }


scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
pipe_gb_rand = RandomizedSearchCV(estimator=pipe_gb_rand, param_distributions=param_grid,n_jobs = -1, n_iter=10, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
pipe_gb_rand.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(pipe_gb_rand.best_params_,pipe_gb_rand.best_score_))
In [ ]:
gb_tuned_rand = make_pipeline(
    StandardScaler(),GradientBoostingClassifier(max_depth=20, min_samples_leaf=20,
                                            min_samples_split=25,
                                            n_estimators=200,
                                            random_state=1))

# Fit the model on training data
gb_tuned_rand.fit(X_train, y_train)
In [ ]:
gb_rand_tuned_score=get_metrics_score(gb_tuned_rand,' Gradient boosting with Random Search',X_train,X_test,y_train,y_test)
  • Grid search took a significantly longer time than random search.
  • Both the model perform the same , both have same recall and are overfitting
In [ ]:
feature_names = X_train.columns
importances = gb_tuned_rand[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

XGBclassifier with Grid Search¶

In [ ]:
%%time 

#Creating pipeline
#Creating pipeline
pipe_xgboost=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))

#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[2,10],
            'xgbclassifier__learning_rate':[0.01,0.1,0.2], 
            'xgbclassifier__subsample':[0.7,1]}


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling GridSearchCV
Xgboost_grid_cv = GridSearchCV(estimator=pipe_xgboost, param_grid=param_grid, scoring=scorer, cv=5, n_jobs = -1)

#Fitting parameters in GridSeachCV
Xgboost_grid_cv.fit(X_train,y_train)


print("Best parameters are {} with CV score={}:" .format(Xgboost_grid_cv.best_params_,Xgboost_grid_cv.best_score_))
In [ ]:
# Creating new pipeline with best parameters
xgb_tuned_grid = make_pipeline(
    StandardScaler(),
    XGBClassifier(
        random_state=1,
        n_estimators=150,
        scale_pos_weight=10,
        subsample=1,
        learning_rate=0.01,
        eval_metric='logloss',
    ),
)

# Fit the model on training data
xgb_tuned_grid.fit(X_train, y_train)
In [ ]:
xgb_tuned_score_grid=get_metrics_score(xgb_tuned_grid,' XGboost with Grid Search',X_train,X_test,y_train,y_test)
  • Recall has improved by ~9% using grid search and hyperparameter.

XGboost using Random Search¶

In [ ]:
%%time 
#Creating pipeline

pipe_xgboost_ran=make_pipeline(StandardScaler(), XGBClassifier(random_state=1,eval_metric='logloss'))

#Parameter grid to pass in random
param_grid={'xgbclassifier__n_estimators':np.arange(50,300,50),'xgbclassifier__scale_pos_weight':[2,10],
            'xgbclassifier__learning_rate':[0.01,0.1,0.2], 
            'xgbclassifier__subsample':[0.7,1]}


# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe_xgboost_ran, param_distributions=param_grid,n_jobs = -1, n_iter=10, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
In [ ]:
# Creating new pipeline with best parameters
xgb_rand = make_pipeline(
    StandardScaler(),
    XGBClassifier(
        random_state=1,
        n_estimators=50,
        scale_pos_weight=10,
        subsample=0.7,
        learning_rate=0.01,
        eval_metric='logloss',
    ),
)

# Fit the model on training data
xgb_rand.fit(X_train, y_train)
In [ ]:
randomized_cv_tuned_score=get_metrics_score(randomized_cv,'XG boosting with Random Search',X_train,X_test,y_train,y_test)
  • Random search took less time but recall was not better than Grid search

Comparing all models¶

In [ ]:
comparison_frame = pd.DataFrame({'Model':model_name,
                                          'Train_Accuracy': acc_train,'Test_Accuracy': acc_test,
                                          'Train_Recall':recall_train,'Test_Recall':recall_test,
                                          'Train_Precision':precision_train,'Test_Precision':precision_test,
                                          'Train_F1':f1_train,
                                          'Test_F1':f1_test  }) 

#Sorting models in decreasing order of test recall
comparison_frame.sort_values(by='Test_Recall',ascending=False)
  • Logistic Regression with oversampling performed very poorly on test data. The recall was only 0.48.
  • The xgboost model tuned using Grid search is giving the best test recall of 0.95. The model can 93% time accuractely predict customers who will attrite.Precision is very low for this model.
  • Time take by Random search was less compared to time taken by Grid search, but that doesn't necessarily mean that the performance was better . Since not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions random search is faster. The number of parameter settings that are tried is given by n_iter. All set of hyperparameters is not searched sequentially.Random search doesn’t guarantee finding the best set of hyperparameters.Grid search did slightly better in case of XGboost and Adaboost.
  • The performance can vary with range of hyperparmeters selected. With more number of hyperparemter grid search will probably take more time.
  • Let's see the feature importance from the tuned xgboost model.
In [ ]:
feature_names = X_train.columns
importances = xgb_tuned_grid[1].feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Total Transcation count is most important features followed by Total Revolving balance and Total Transacational amount.

Conclusion¶

  • Random Grid search takes less time and tries to choose best parameters , but that does not guaratee it will perform well.
  • Different hyperparamters can be tried to improve some of the model.
  • XGbooost with grid search performed best.
  • Total Transcation count is most important features followed by Total Revolving balance and Total Transacational amount.
  • Customers lower transcation , lower revolving balance , lower transcational amount are an indication that customer will attrite.

Customer Profile¶

Customers that has 3 or more traits are most likley to attrite. They should be targeted.

  • Age 36-55
  • Holds less than 3 banking products
  • has a higher level of education
  • Has been inactive

Business Recommendations & Insights¶

  • Lower transcation count on credit card , less revolving balance , less transcational amount are an indication that customer will attrite. Lower transcation indicate customer is not using this credit card , bank should offer more rewards or cashback or some other offers to customer to use the credit card more.
  • The results of the analysis show that if customer holds more products they are less likely to attrite. The bank should prioritize offering more products to such customers so they buy more products which will help retain such customers
  • Customers who have been inactive for a month show high chances of attrition. These customers should be focused on such customers as well.
  • Avg utilization ratio is lower amongst attrited customers.
  • As per EDA Customer in age range 36-55 ,who were doctorate or postgraduate ,or Female attrited more. One of the reasons can be some competitive bank is offering them better deals leading to lesser user of this banks credit card.
  • As per the EDA Customers who have had high number of contacts with the bank in the last 12 months have attrited. This needs to be investigated whether there were any issues of customers which were not resolved leading into customer leaving the bank.